Postnatal environmental exposures, particularly those found in household products and dietary intake, along with specific serum metabolomics profiles, are significantly associated with the BMI Z-score of children aged 6-11 years. Higher concentrations of certain metabolites in serum, reflecting exposure to chemical classes or metals, will correlate with variations in BMI Z-score, controlling for age and other relevant covariates. Some metabolites associated with chemical exposures and dietary patterns can serve as biomarkers for the risk of developing obesity.
Research indicates that postnatal exposure to endocrine-disrupting chemicals (EDCs) such as phthalates, bisphenol A (BPA), and polychlorinated biphenyls (PCBs) can significantly influence body weight and metabolic health (Junge et al., 2018). These chemicals, commonly found in household products and absorbed through dietary intake, are linked to detrimental effects on body weight and metabolic health in children. This hormonal interference can lead to an increased body mass index (BMI) in children, suggesting a potential pathway through which exposure to these chemicals contributes to the development of obesity.
A longitudinal study on Japanese children examined the impact of postnatal exposure (first two years of life) to p,p’-dichlorodiphenyltrichloroethane (p,p’-DDT) and p,p’-dichlorodiphenyldichloroethylene (p,p’-DDE) through breastfeeding (Plouffe et al., 2020). The findings revealed that higher levels of these chemicals in breast milk were associated with increased BMI at 42 months of age. DDT and DDE may interfere with hormonal pathways related to growth and development. These chemicals can mimic or disrupt hormones that regulate metabolism and fat accumulation. This study highlights the importance of understanding how persistent organic pollutants can affect early childhood growth and development.
The study by Harley et al. (2013) investigates the association between prenatal and postnatal Bisphenol A (BPA) exposure and various body composition metrics in children aged 9 years from the CHAMACOS cohort. The study found that higher prenatal BPA exposure was linked to a decrease in BMI and body fat percentages in girls but not boys, suggesting sex-specific effects. Conversely, BPA levels measured at age 9 were positively associated with increased adiposity in both genders, highlighting the different impacts of exposure timing on childhood development.
The 2022 study 2022 study by Uldbjerg et al. explored the effects of combined exposures to multiple EDCs, suggesting that mixtures of these chemicals can have additive or synergistic effects on BMI and obesity risk. Humans are typically exposed to a mixture of chemicals rather than individual EDCs, making it crucial to understand how these mixtures might interact. The research highlighted that the interaction between different EDCs can lead to additive (where the effects simply add up) or even synergistic (where the combined effect is greater than the sum of their separate effects) outcomes. These interactions can significantly amplify the risk factors associated with obesity and metabolic disorders in children. The dose-response relationship found that even low-level exposure to multiple EDCs could result in significant health impacts due to their combined effects.
These studies collectively illustrate the critical role of environmental EDCs in shaping metabolic health outcomes in children, highlighting the necessity for ongoing research and policy intervention to mitigate these risks.
This study utilizes data from the subcohort of 1301 mother-child pairs in the HELIX study, who are which aged 6-11 years for whom complete exposure and outcome data were available. Exposure data included detailed dietary records after pregnancy and concentrations of various chemicals like BPA and PCBs in child blood samples. There are categorical and numerical variables, which will include both demographic details and biochemical measurements. This dataset allows for robust statistical analysis to identify potential associations between EDC exposure and changes in BMI Z-scores, considering confounding factors such as age, gender, and socioeconomic status. There are no missing data so there is not need to impute the information. Child BMI Z-scores were calculated based on WHO growth standards.
load("/Users/allison/Library/CloudStorage/GoogleDrive-aflouie@usc.edu/My Drive/HELIX_data/HELIX.RData")
filtered_chem_diet <- codebook %>%
filter(domain %in% c("Chemicals", "Lifestyles") & period == "Postnatal" & subfamily != "Allergens")
# specific covariates
filtered_covariates <- codebook %>%
filter(domain == "Covariates" &
variable_name %in% c("ID", "e3_sex_None", "e3_yearbir_None", "h_cohort", "hs_child_age_None"))
#specific phenotype variables
filtered_phenotype <- codebook %>%
filter(domain == "Phenotype" &
variable_name %in% c("hs_zbmi_who"))
# combining all necessary variables together
combined_codebook <- bind_rows(filtered_chem_diet, filtered_covariates, filtered_phenotype)
kable(combined_codebook, align = "c", format = "html") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = F)
| variable_name | domain | family | subfamily | period | location | period_postnatal | description | var_type | transformation | labels | labelsshort | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| h_bfdur_Ter | h_bfdur_Ter | Lifestyles | Lifestyle | Diet | Postnatal | NA | NA | Breastfeeding duration (weeks) | factor | Tertiles | Breastfeeding | Breastfeeding |
| hs_bakery_prod_Ter | hs_bakery_prod_Ter | Lifestyles | Lifestyle | Diet | Postnatal | NA | NA | Food group: bakery products (hs_cookies + hs_pastries) | factor | Tertiles | Bakery prod | BakeProd |
| hs_beverages_Ter | hs_beverages_Ter | Lifestyles | Lifestyle | Diet | Postnatal | NA | NA | Food group: beverages (hs_dietsoda+hs_soda) | factor | Tertiles | Soda | Soda |
| hs_break_cer_Ter | hs_break_cer_Ter | Lifestyles | Lifestyle | Diet | Postnatal | NA | NA | Food group: breakfast cereal (hs_sugarcer+hs_othcer) | factor | Tertiles | BF cereals | BFcereals |
| hs_caff_drink_Ter | hs_caff_drink_Ter | Lifestyles | Lifestyle | Diet | Postnatal | NA | NA | Drinks a caffeinated or æenergy drink (eg coca-cola, diet-coke, redbull) | factor | Tertiles | Caffeine | Caffeine |
| hs_dairy_Ter | hs_dairy_Ter | Lifestyles | Lifestyle | Diet | Postnatal | NA | NA | Food group: dairy (hs_cheese + hs_milk + hs_yogurt+ hs_probiotic+ hs_desert) | factor | Tertiles | Dairy | Dairy |
| hs_fastfood_Ter | hs_fastfood_Ter | Lifestyles | Lifestyle | Diet | Postnatal | NA | NA | Visits a fast food restaurant/take away | factor | Tertiles | Fastfood | Fastfood |
| hs_KIDMED_None | hs_KIDMED_None | Lifestyles | Lifestyle | Diet | Postnatal | NA | NA | Sum of KIDMED indices, without index9 | numeric | None | KIDMED | KIDMED |
| hs_mvpa_prd_alt_None | hs_mvpa_prd_alt_None | Lifestyles | Lifestyle | Physical activity | Postnatal | NA | NA | Clean & Over-reporting of Moderate-to-Vigorous Physical Activity (min/day) | numeric | None | PA | PA |
| hs_org_food_Ter | hs_org_food_Ter | Lifestyles | Lifestyle | Diet | Postnatal | NA | NA | Eats organic food | factor | Tertiles | Organicfood | Organicfood |
| hs_proc_meat_Ter | hs_proc_meat_Ter | Lifestyles | Lifestyle | Diet | Postnatal | NA | NA | Food group: processed meat (hs_coldmeat+hs_ham) | factor | Tertiles | Processed meat | ProcMeat |
| hs_readymade_Ter | hs_readymade_Ter | Lifestyles | Lifestyle | Diet | Postnatal | NA | NA | Eats a æready-made supermarket meal | factor | Tertiles | Ready made food | ReadyFood |
| hs_sd_wk_None | hs_sd_wk_None | Lifestyles | Lifestyle | Physical activity | Postnatal | NA | NA | sedentary behaviour (min/day) | numeric | None | Sedentary | Sedentary |
| hs_total_bread_Ter | hs_total_bread_Ter | Lifestyles | Lifestyle | Diet | Postnatal | NA | NA | Food group: bread (hs_darkbread+hs_whbread) | factor | Tertiles | Bread | Bread |
| hs_total_cereal_Ter | hs_total_cereal_Ter | Lifestyles | Lifestyle | Diet | Postnatal | NA | NA | Food group: cereal (hs_darkbread + hs_whbread + hs_rice_pasta + hs_sugarcer + hs_othcer + hs_rusks) | factor | Tertiles | Cereals | Cereals |
| hs_total_fish_Ter | hs_total_fish_Ter | Lifestyles | Lifestyle | Diet | Postnatal | NA | NA | Food group: fish and seafood (hs_canfish+hs_oilyfish+hs_whfish+hs_seafood) | factor | Tertiles | Fish | Fish |
| hs_total_fruits_Ter | hs_total_fruits_Ter | Lifestyles | Lifestyle | Diet | Postnatal | NA | NA | Food group: fruits (hs_canfruit+hs_dryfruit+hs_freshjuice+hs_fruits) | factor | Tertiles | Fruits | Fruits |
| hs_total_lipids_Ter | hs_total_lipids_Ter | Lifestyles | Lifestyle | Diet | Postnatal | NA | NA | Food group: Added fat | factor | Tertiles | Diet fat | Diet fat |
| hs_total_meat_Ter | hs_total_meat_Ter | Lifestyles | Lifestyle | Diet | Postnatal | NA | NA | Food group: meat (hs_coldmeat+hs_ham+hs_poultry+hs_redmeat) | factor | Tertiles | Meat | Meat |
| hs_total_potatoes_Ter | hs_total_potatoes_Ter | Lifestyles | Lifestyle | Diet | Postnatal | NA | NA | Food group: potatoes (hs_frenchfries+hs_potatoes) | factor | Tertiles | Potatoes | Potatoes |
| hs_total_sweets_Ter | hs_total_sweets_Ter | Lifestyles | Lifestyle | Diet | Postnatal | NA | NA | Food group: sweets (hs_choco + hs_sweets + hs_sugar) | factor | Tertiles | Sweets | Sweets |
| hs_total_veg_Ter | hs_total_veg_Ter | Lifestyles | Lifestyle | Diet | Postnatal | NA | NA | Food group: vegetables (hs_cookveg+hs_rawveg) | factor | Tertiles | Vegetables | Vegetables |
| hs_total_yog_Ter | hs_total_yog_Ter | Lifestyles | Lifestyle | Diet | Postnatal | NA | NA | Food group: yogurt (hs_yogurt+hs_probiotic) | factor | Tertiles | Yogurt | Yogurt |
| hs_dif_hours_total_None | hs_dif_hours_total_None | Lifestyles | Lifestyle | Sleep | Postnatal | NA | NA | Total hours of sleep (mean weekdays and night) | numeric | None | Sleep | Sleep |
| hs_as_c_Log2 | hs_as_c_Log2 | Chemicals | Metals | As | Postnatal | NA | NA | Arsenic (As) in child | numeric | Logarithm base 2 | As | As |
| hs_cd_c_Log2 | hs_cd_c_Log2 | Chemicals | Metals | Cd | Postnatal | NA | NA | Cadmium (Cd) in child | numeric | Logarithm base 2 | Cd | Cd |
| hs_co_c_Log2 | hs_co_c_Log2 | Chemicals | Metals | Co | Postnatal | NA | NA | Cobalt (Co) in child | numeric | Logarithm base 2 | Co | Co |
| hs_cs_c_Log2 | hs_cs_c_Log2 | Chemicals | Metals | Cs | Postnatal | NA | NA | Caesium (Cs) in child | numeric | Logarithm base 2 | Cs | Cs |
| hs_cu_c_Log2 | hs_cu_c_Log2 | Chemicals | Metals | Cu | Postnatal | NA | NA | Copper (Cu) in child | numeric | Logarithm base 2 | Cu | Cu |
| hs_hg_c_Log2 | hs_hg_c_Log2 | Chemicals | Metals | Hg | Postnatal | NA | NA | Mercury (Hg) in child | numeric | Logarithm base 2 | Hg | Hg |
| hs_mn_c_Log2 | hs_mn_c_Log2 | Chemicals | Metals | Mn | Postnatal | NA | NA | Manganese (Mn) in child | numeric | Logarithm base 2 | Mn | Mn |
| hs_mo_c_Log2 | hs_mo_c_Log2 | Chemicals | Metals | Mo | Postnatal | NA | NA | Molybdenum (Mo) in child | numeric | Logarithm base 2 | Mo | Mo |
| hs_pb_c_Log2 | hs_pb_c_Log2 | Chemicals | Metals | Pb | Postnatal | NA | NA | Lead (Pb) in child | numeric | Logarithm base 2 | Pb | Pb |
| hs_tl_cdich_None | hs_tl_cdich_None | Chemicals | Metals | Tl | Postnatal | NA | NA | Dichotomous variable of thallium (Tl) in child | factor | None | Tl | Tl |
| hs_dde_cadj_Log2 | hs_dde_cadj_Log2 | Chemicals | Organochlorines | DDE | Postnatal | NA | NA | Dichlorodiphenyldichloroethylene (DDE) in child adjusted for lipids | numeric | Logarithm base 2 | DDE | DDE |
| hs_ddt_cadj_Log2 | hs_ddt_cadj_Log2 | Chemicals | Organochlorines | DDT | Postnatal | NA | NA | Dichlorodiphenyltrichloroethane (DDT) in child adjusted for lipids | numeric | Logarithm base 2 | DDT | DDT |
| hs_hcb_cadj_Log2 | hs_hcb_cadj_Log2 | Chemicals | Organochlorines | HCB | Postnatal | NA | NA | Hexachlorobenzene (HCB) in child adjusted for lipids | numeric | Logarithm base 2 | HCB | HCB |
| hs_pcb118_cadj_Log2 | hs_pcb118_cadj_Log2 | Chemicals | Organochlorines | PCBs | Postnatal | NA | NA | Polychlorinated biphenyl -118 (PCB-118) in child adjusted for lipids | numeric | Logarithm base 2 | PCB 118 | PCB118 |
| hs_pcb138_cadj_Log2 | hs_pcb138_cadj_Log2 | Chemicals | Organochlorines | PCBs | Postnatal | NA | NA | Polychlorinated biphenyl-138 (PCB-138) in child adjusted for lipids | numeric | Logarithm base 2 | PCB 138 | PCB138 |
| hs_pcb153_cadj_Log2 | hs_pcb153_cadj_Log2 | Chemicals | Organochlorines | PCBs | Postnatal | NA | NA | Polychlorinated biphenyl-153 (PCB-153) in child adjusted for lipids | numeric | Logarithm base 2 | PCB 153 | PCB153 |
| hs_pcb170_cadj_Log2 | hs_pcb170_cadj_Log2 | Chemicals | Organochlorines | PCBs | Postnatal | NA | NA | Polychlorinated biphenyl-170 (PCB-170) in child adjusted for lipids | numeric | Logarithm base 2 | PCB 170 | PCB170 |
| hs_pcb180_cadj_Log2 | hs_pcb180_cadj_Log2 | Chemicals | Organochlorines | PCBs | Postnatal | NA | NA | Polychlorinated biphenyl-180 (PCB-180) in child adjusted for lipids | numeric | Logarithm base 2 | PCB 180 | PCB180 |
| hs_sumPCBs5_cadj_Log2 | hs_sumPCBs5_cadj_Log2 | Chemicals | Organochlorines | PCBs | Postnatal | NA | NA | Sum of PCBs in child adjusted for lipids (4 cohorts) | numeric | Logarithm base 2 | PCBs | SumPCB |
| hs_dep_cadj_Log2 | hs_dep_cadj_Log2 | Chemicals | Organophosphate pesticides | DEP | Postnatal | NA | NA | Diethyl phosphate (DEP) in child adjusted for creatinine | numeric | Logarithm base 2 | DEP | DEP |
| hs_detp_cadj_Log2 | hs_detp_cadj_Log2 | Chemicals | Organophosphate pesticides | DETP | Postnatal | NA | NA | Diethyl thiophosphate (DETP) in child adjusted for creatinine | numeric | Logarithm base 2 | DETP | DETP |
| hs_dmdtp_cdich_None | hs_dmdtp_cdich_None | Chemicals | Organophosphate pesticides | DMDTP | Postnatal | NA | NA | Dichotomous variable of dimethyl dithiophosphate (DMDTP) in child | factor | None | DMDTP | DMDTP |
| hs_dmp_cadj_Log2 | hs_dmp_cadj_Log2 | Chemicals | Organophosphate pesticides | DMP | Postnatal | NA | NA | Dimethyl phosphate (DMP) in child adjusted for creatinine | numeric | Logarithm base 2 | DMP | DMP |
| hs_dmtp_cadj_Log2 | hs_dmtp_cadj_Log2 | Chemicals | Organophosphate pesticides | DMTP | Postnatal | NA | NA | Dimethyl thiophosphate (DMTP) in child adjusted for creatinine | numeric | Logarithm base 2 | DMDTP | DMTP |
| hs_pbde153_cadj_Log2 | hs_pbde153_cadj_Log2 | Chemicals | Polybrominated diphenyl ethers (PBDE) | PBDE153 | Postnatal | NA | NA | Polybrominated diphenyl ether-153 (PBDE-153) in child adjusted for lipids | numeric | Logarithm base 2 | PBDE 153 | PBDE153 |
| hs_pbde47_cadj_Log2 | hs_pbde47_cadj_Log2 | Chemicals | Polybrominated diphenyl ethers (PBDE) | PBDE47 | Postnatal | NA | NA | Polybrominated diphenyl ether-47 (PBDE-47) in child adjusted for lipids | numeric | Logarithm base 2 | PBDE 47 | PBDE47 |
| hs_pfhxs_c_Log2 | hs_pfhxs_c_Log2 | Chemicals | Per- and polyfluoroalkyl substances (PFAS) | PFHXS | Postnatal | NA | NA | Perfluorohexane sulfonate (PFHXS) in child | numeric | Logarithm base 2 | PFHXS | PFHXS |
| hs_pfna_c_Log2 | hs_pfna_c_Log2 | Chemicals | Per- and polyfluoroalkyl substances (PFAS) | PFNA | Postnatal | NA | NA | Perfluorononanoate (PFNA) in child | numeric | Logarithm base 2 | PFNA | PFNA |
| hs_pfoa_c_Log2 | hs_pfoa_c_Log2 | Chemicals | Per- and polyfluoroalkyl substances (PFAS) | PFOA | Postnatal | NA | NA | Perfluorooctanoate (PFOA) in child | numeric | Logarithm base 2 | PFOA | PFOA |
| hs_pfos_c_Log2 | hs_pfos_c_Log2 | Chemicals | Per- and polyfluoroalkyl substances (PFAS) | PFOS | Postnatal | NA | NA | Perfluorooctane sulfonate (PFOS) in child | numeric | Logarithm base 2 | PFOS | PFOS |
| hs_pfunda_c_Log2 | hs_pfunda_c_Log2 | Chemicals | Per- and polyfluoroalkyl substances (PFAS) | PFUNDA | Postnatal | NA | NA | Perfluoroundecanoate (PFUNDA) in child | numeric | Logarithm base 2 | PFUNDA | PFUNDA |
| hs_bpa_cadj_Log2 | hs_bpa_cadj_Log2 | Chemicals | Phenols | BPA | Postnatal | NA | NA | Bisphenol A (BPA) in child adjusted for creatinine | numeric | Logarithm base 2 | BPA | BPA |
| hs_bupa_cadj_Log2 | hs_bupa_cadj_Log2 | Chemicals | Phenols | BUPA | Postnatal | NA | NA | N-Butyl paraben (BUPA) in child adjusted for creatinine | numeric | Logarithm base 2 | BUPA | BUPA |
| hs_etpa_cadj_Log2 | hs_etpa_cadj_Log2 | Chemicals | Phenols | ETPA | Postnatal | NA | NA | Ethyl paraben (ETPA) in child adjusted for creatinine | numeric | Logarithm base 2 | ETPA | ETPA |
| hs_mepa_cadj_Log2 | hs_mepa_cadj_Log2 | Chemicals | Phenols | MEPA | Postnatal | NA | NA | Methyl paraben (MEPA) in child adjusted for creatinine | numeric | Logarithm base 2 | MEPA | MEPA |
| hs_oxbe_cadj_Log2 | hs_oxbe_cadj_Log2 | Chemicals | Phenols | OXBE | Postnatal | NA | NA | Oxybenzone (OXBE) in child adjusted for creatinine | numeric | Logarithm base 2 | OXBE | OXBE |
| hs_prpa_cadj_Log2 | hs_prpa_cadj_Log2 | Chemicals | Phenols | PRPA | Postnatal | NA | NA | Propyl paraben (PRPA) in child adjusted for creatinine | numeric | Logarithm base 2 | PRPA | PRPA |
| hs_trcs_cadj_Log2 | hs_trcs_cadj_Log2 | Chemicals | Phenols | TRCS | Postnatal | NA | NA | Triclosan (TRCS) in child adjusted for creatinine | numeric | Logarithm base 2 | TRCS | TRCS |
| hs_mbzp_cadj_Log2 | hs_mbzp_cadj_Log2 | Chemicals | Phthalates | MBZP | Postnatal | NA | NA | Mono benzyl phthalate (MBzP) in child adjusted for creatinine | numeric | Logarithm base 2 | MBZP | MBZP |
| hs_mecpp_cadj_Log2 | hs_mecpp_cadj_Log2 | Chemicals | Phthalates | MECPP | Postnatal | NA | NA | Mono-2-ethyl 5-carboxypentyl phthalate (MECPP) in child adjusted for creatinine | numeric | Logarithm base 2 | MECPP | MECPP |
| hs_mehhp_cadj_Log2 | hs_mehhp_cadj_Log2 | Chemicals | Phthalates | MEHHP | Postnatal | NA | NA | Mono-2-ethyl-5-hydroxyhexyl phthalate (MEHHP) in child adjusted for creatinine | numeric | Logarithm base 2 | MEHHP | MEHHP |
| hs_mehp_cadj_Log2 | hs_mehp_cadj_Log2 | Chemicals | Phthalates | MEHP | Postnatal | NA | NA | Mono-2-ethylhexyl phthalate (MEHP) in child adjusted for creatinine | numeric | Logarithm base 2 | MEHP | MEHP |
| hs_meohp_cadj_Log2 | hs_meohp_cadj_Log2 | Chemicals | Phthalates | MEOHP | Postnatal | NA | NA | Mono-2-ethyl-5-oxohexyl phthalate (MEOHP) in child adjusted for creatinine | numeric | Logarithm base 2 | MEOHP | MEOHP |
| hs_mep_cadj_Log2 | hs_mep_cadj_Log2 | Chemicals | Phthalates | MEP | Postnatal | NA | NA | Monoethyl phthalate (MEP) in child adjusted for creatinine | numeric | Logarithm base 2 | MEP | MEP |
| hs_mibp_cadj_Log2 | hs_mibp_cadj_Log2 | Chemicals | Phthalates | MIBP | Postnatal | NA | NA | Mono-iso-butyl phthalate (MiBP) in child adjusted for creatinine | numeric | Logarithm base 2 | MIBP | MIBP |
| hs_mnbp_cadj_Log2 | hs_mnbp_cadj_Log2 | Chemicals | Phthalates | MNBP | Postnatal | NA | NA | Mono-n-butyl phthalate (MnBP) in child adjusted for creatinine | numeric | Logarithm base 2 | MNBP | MNBP |
| hs_ohminp_cadj_Log2 | hs_ohminp_cadj_Log2 | Chemicals | Phthalates | OHMiNP | Postnatal | NA | NA | Mono-4-methyl-7-hydroxyoctyl phthalate (OHMiNP) in child adjusted for creatinine | numeric | Logarithm base 2 | OHMiNP | OHMiNP |
| hs_oxominp_cadj_Log2 | hs_oxominp_cadj_Log2 | Chemicals | Phthalates | OXOMINP | Postnatal | NA | NA | Mono-4-methyl-7-oxooctyl phthalate (OXOMiNP) in child adjusted for creatinine | numeric | Logarithm base 2 | OXOMINP | OXOMINP |
| hs_sumDEHP_cadj_Log2 | hs_sumDEHP_cadj_Log2 | Chemicals | Phthalates | DEHP | Postnatal | NA | NA | Sum of DEHP metabolites (µg/g) in child adjusted for creatinine | numeric | Logarithm base 2 | DEHP | SumDEHP |
| FAS_cat_None | FAS_cat_None | Chemicals | Social and economic capital | Economic capital | Postnatal | NA | NA | Family affluence score | factor | None | Family affluence | FamAfl |
| hs_contactfam_3cat_num_None | hs_contactfam_3cat_num_None | Chemicals | Social and economic capital | Social capital | Postnatal | NA | NA | scoial capital: family friends | factor | None | Social contact | SocCont |
| hs_hm_pers_None | hs_hm_pers_None | Chemicals | Social and economic capital | Social capital | Postnatal | NA | NA | How many people live in your home? | numeric | None | House crowding | HouseCrow |
| hs_participation_3cat_None | hs_participation_3cat_None | Chemicals | Social and economic capital | Social capital | Postnatal | NA | NA | social capital: structural | factor | None | Social participation | SocPartic |
| hs_cotinine_cdich_None | hs_cotinine_cdich_None | Chemicals | Tobacco Smoke | Cotinine | Postnatal | NA | NA | Dichotomous variable of cotinine in child | factor | None | Cotinine | Cotinine |
| hs_globalexp2_None | hs_globalexp2_None | Chemicals | Tobacco Smoke | Tobacco Smoke | Postnatal | NA | NA | Global exposure of the child to ETS (2 categories) | factor | None | ETS | ETS |
| hs_smk_parents_None | hs_smk_parents_None | Chemicals | Tobacco Smoke | Tobacco Smoke | Postnatal | NA | NA | Tobacco Smoke status of parents (both) | factor | None | Smoking_parents | SmokPar |
| e3_sex_None | e3_sex_None | Covariates | Covariates | Child covariate | Pregnancy | NA | NA | Child sex (female / male) | factor | None | Child sex | Sex |
| e3_yearbir_None | e3_yearbir_None | Covariates | Covariates | Child covariate | Pregnancy | NA | NA | Year of birth (2003 to 2009) | factor | None | Year of birth | YearBirth |
| h_cohort | h_cohort | Covariates | Covariates | Maternal covariate | Pregnancy | NA | NA | Cohort of inclusion (1 to 6) | factor | None | Cohort | Cohort |
| hs_child_age_None | hs_child_age_None | Covariates | Covariates | Child covariate | Postnatal | NA | NA | Child age at examination (years) | numeric | None | Child age | cAge |
| hs_zbmi_who | hs_zbmi_who | Phenotype | Phenotype | Outcome at 6-11 years old | Postnatal | NA | NA | Body mass index z-score at 6-11 years old - WHO reference - Standardized on sex and age | numeric | None | Body mass index z-score | zBMI |
# specific lifestyle exposures
lifestyle_exposures <- c(
"h_bfdur_Ter",
"hs_bakery_prod_Ter",
"hs_break_cer_Ter",
"hs_dairy_Ter",
"hs_fastfood_Ter",
"hs_org_food_Ter",
"hs_proc_meat_Ter",
"hs_total_fish_Ter",
"hs_total_fruits_Ter",
"hs_total_lipids_Ter",
"hs_total_sweets_Ter",
"hs_total_veg_Ter"
)
lifestyle_exposome <- dplyr::select(exposome, all_of(lifestyle_exposures))
summarytools::view(dfSummary(lifestyle_exposome, style = 'grid', plain.ascii = FALSE, valid.col = FALSE, headings = FALSE), method = "render")
| No | Variable | Stats / Values | Freqs (% of Valid) | Graph | Missing | |||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | h_bfdur_Ter [factor] |
|
|
0 (0.0%) | ||||||||||||||||
| 2 | hs_bakery_prod_Ter [factor] |
|
|
0 (0.0%) | ||||||||||||||||
| 3 | hs_break_cer_Ter [factor] |
|
|
0 (0.0%) | ||||||||||||||||
| 4 | hs_dairy_Ter [factor] |
|
|
0 (0.0%) | ||||||||||||||||
| 5 | hs_fastfood_Ter [factor] |
|
|
0 (0.0%) | ||||||||||||||||
| 6 | hs_org_food_Ter [factor] |
|
|
0 (0.0%) | ||||||||||||||||
| 7 | hs_proc_meat_Ter [factor] |
|
|
0 (0.0%) | ||||||||||||||||
| 8 | hs_total_fish_Ter [factor] |
|
|
0 (0.0%) | ||||||||||||||||
| 9 | hs_total_fruits_Ter [factor] |
|
|
0 (0.0%) | ||||||||||||||||
| 10 | hs_total_lipids_Ter [factor] |
|
|
0 (0.0%) | ||||||||||||||||
| 11 | hs_total_sweets_Ter [factor] |
|
|
0 (0.0%) | ||||||||||||||||
| 12 | hs_total_veg_Ter [factor] |
|
|
0 (0.0%) |
Generated by summarytools 1.0.1 (R version 4.4.0)
2024-07-21
categorical_lifestyle <- lifestyle_exposome %>%
dplyr::select(where(is.factor))
categorical_lifestyle_long <- pivot_longer(
categorical_lifestyle,
cols = everything(),
names_to = "variable",
values_to = "value"
)
unique_categorical_vars <- unique(categorical_lifestyle_long$variable)
categorical_plots <- lapply(unique_categorical_vars, function(var) {
data <- filter(categorical_lifestyle_long, variable == var)
p <- ggplot(data, aes(x = value, fill = value)) +
geom_bar(stat = "count") +
labs(title = paste("Distribution of", var), x = var, y = "Count")
print(p)
return(p)
})
Breastfeeding Duration: Majority of observations are in the highest duration category, suggesting longer breastfeeding periods are common.
Bakery Products: Shows a relatively even distribution across the three categories, indicating varied consumption levels of bakery products among participants.
Breakfast Cereal: The highest category of cereal consumption is the most common, suggesting a preference for or greater consumption of cereals.
Dairy: Shows a fairly even distribution across all categories, indicating a uniform consumption pattern of dairy products.
Fast Food: Most participants fall into the middle category, indicating moderate consumption of fast food.
Organic Food: Most participants either consume a lot of or no organic food, with fewer in the middle range.
Processed Meat: Consumption levels are fairly evenly distributed, indicating varied dietary habits regarding processed meats.
Bread: Distribution shows a significant leaning towards higher bread consumption.
Cereal: Even distribution across categories suggests varied cereal consumption habits.
Fish and Seafood: Even distribution across categories, indicating varied consumption of fish and seafood.
Fruits: High fruit consumption is the most common, with fewer participants in the lowest category.
Added Fats: More participants consume added fats at the lowest and highest levels, with fewer in the middle.
Sweets: High consumption of sweets is the most common, indicating a preference for or higher access to sugary foods.
Vegetables: Most participants consume a high amount of vegetables.
# specific chemical exposures
chemical_exposures <- c(
"hs_cd_c_Log2",
"hs_co_c_Log2",
"hs_cs_c_Log2",
"hs_cu_c_Log2",
"hs_hg_c_Log2",
"hs_mo_c_Log2",
"hs_pb_c_Log2",
"hs_dde_cadj_Log2",
"hs_pcb153_cadj_Log2",
"hs_pcb170_cadj_Log2",
"hs_dep_cadj_Log2",
"hs_pbde153_cadj_Log2",
"hs_pfhxs_c_Log2",
"hs_pfoa_c_Log2",
"hs_pfos_c_Log2",
"hs_prpa_cadj_Log2",
"hs_mbzp_cadj_Log2",
"hs_mibp_cadj_Log2",
"hs_mnbp_cadj_Log2"
)
chemical_exposome <- dplyr::select(exposome, all_of(chemical_exposures))
summarytools::view(dfSummary(chemical_exposome, style = 'grid', plain.ascii = FALSE, valid.col = FALSE, headings = FALSE), method = "render")
| No | Variable | Stats / Values | Freqs (% of Valid) | Graph | Missing | ||||
|---|---|---|---|---|---|---|---|---|---|
| 1 | hs_cd_c_Log2 [numeric] |
|
695 distinct values | 0 (0.0%) | |||||
| 2 | hs_co_c_Log2 [numeric] |
|
317 distinct values | 0 (0.0%) | |||||
| 3 | hs_cs_c_Log2 [numeric] |
|
369 distinct values | 0 (0.0%) | |||||
| 4 | hs_cu_c_Log2 [numeric] |
|
345 distinct values | 0 (0.0%) | |||||
| 5 | hs_hg_c_Log2 [numeric] |
|
698 distinct values | 0 (0.0%) | |||||
| 6 | hs_mo_c_Log2 [numeric] |
|
593 distinct values | 0 (0.0%) | |||||
| 7 | hs_pb_c_Log2 [numeric] |
|
529 distinct values | 0 (0.0%) | |||||
| 8 | hs_dde_cadj_Log2 [numeric] |
|
1050 distinct values | 0 (0.0%) | |||||
| 9 | hs_pcb153_cadj_Log2 [numeric] |
|
1047 distinct values | 0 (0.0%) | |||||
| 10 | hs_pcb170_cadj_Log2 [numeric] |
|
1039 distinct values | 0 (0.0%) | |||||
| 11 | hs_dep_cadj_Log2 [numeric] |
|
1045 distinct values | 0 (0.0%) | |||||
| 12 | hs_pbde153_cadj_Log2 [numeric] |
|
1036 distinct values | 0 (0.0%) | |||||
| 13 | hs_pfhxs_c_Log2 [numeric] |
|
1061 distinct values | 0 (0.0%) | |||||
| 14 | hs_pfoa_c_Log2 [numeric] |
|
1061 distinct values | 0 (0.0%) | |||||
| 15 | hs_pfos_c_Log2 [numeric] |
|
1050 distinct values | 0 (0.0%) | |||||
| 16 | hs_prpa_cadj_Log2 [numeric] |
|
1031 distinct values | 0 (0.0%) | |||||
| 17 | hs_mbzp_cadj_Log2 [numeric] |
|
1046 distinct values | 0 (0.0%) | |||||
| 18 | hs_mibp_cadj_Log2 [numeric] |
|
1057 distinct values | 0 (0.0%) | |||||
| 19 | hs_mnbp_cadj_Log2 [numeric] |
|
1048 distinct values | 0 (0.0%) |
Generated by summarytools 1.0.1 (R version 4.4.0)
2024-07-21
#separate numeric and categorical data
numeric_chemical <- chemical_exposome %>%
dplyr::select(where(is.numeric))
numeric_chemical_long <- pivot_longer(
numeric_chemical,
cols = everything(),
names_to = "variable",
values_to = "value"
)
unique_numerical_vars <- unique(numeric_chemical_long$variable)
num_plots <- lapply(unique_numerical_vars, function(var) {
data <- filter(numeric_chemical_long, variable == var)
p <- ggplot(data, aes(x = value)) +
geom_histogram(bins = 30, fill = "blue") +
labs(title = paste("Histogram of", var), x = "Value", y = "Count")
print(p)
return(p)
})
Cadmium (hs_cd_c_Log2): The histogram for Cadmium exposure (hs_cd_c_Log2) shows a relatively symmetric distribution centered around 4 on the log2 scale. Most values range from approximately 3 to 5, with few outliers.
Cobalt (hs_co_c_Log2): The histogram of cobalt levels displays a roughly normal distribution centered around a slight positive skew, peaking around 3.5.
Cesium (hs_cs_c_Log2): Exhibits a right-skewed distribution, indicating that most participants have relatively low exposure levels, but a small number have substantially higher exposures. Majority of the data centered around 1.5 to 2.5
Copper (hs_cu_c_Log2): Shows a right-skewed distribution, suggesting that while most individuals have moderate exposure, a few experience significantly higher levels of copper.
Mercury (hs_hg_c_Log2): This distribution is also right-skewed, common for environmental pollutants, where a majority have lower exposure levels, and a minority have high exposure levels.
Molybdenum (hs_mo_c_Log2): Shows a distribution with a sharp peak and a long right tail, suggesting that while most people have similar exposure levels, a few have exceptionally high exposures. Has a sharp peak around 6, indicating that most values fall within a narrow range.
Lead (hs_pb_c_Log2): The distribution is slightly right-skewed, indicating higher exposure levels in a smaller group of the population compared to the majority.
DDE (hs_dde_cadj_Log2): Shows a pronounced right skew, typical for chemicals that accumulate in the environment and in human tissues, indicating higher levels of exposure in a smaller subset of the population..
PCB 153 (hs_pcb153_cadj_Log2): Has a distribution with right skewness, suggesting that exposure to these compounds is higher among a smaller segment of the population. Bimodal, indicating two peaks around 2 and 4.
PCB 170 (hs_pcb170_cadj_Log2): This histograms show a significant right skew, indicating lower concentrations of these chemicals in most samples, with fewer samples showing higher concentrations. This pattern suggests that while most individuals have low exposure, a few may have considerably higher levels.
DEP (hs_dep_cadj_Log2): DEP exposure has a sharp peak around 6, indicating a narrow distribution of values.
PBDE 153 (hs_pbde153_cadj_Log2): This histogram shows a bimodal distribution, with peaks around 1 and 4.
PFHxS: Distribution peaks around 5 with a broad spread, indicating variation in PFHxS levels.
PFOA: Shows a peak around 6 with a symmetrical distribution, indicating consistent PFOA levels.
PFOS: Similar to PFOA, the distribution peaks around 6, indicating consistent PFOS levels.
Propyl Paraben (hs_prpa_cadj_Log2): The distribution peaks around 6 with a broad spread, indicating variability in propyl paraben levels.
Monobenzyl Phthalate (hs_mbzp_cadj_Log2): This histogram shows a right-skewed distribution. Most values cluster at the lower end, indicating a common lower exposure level among subjects, with a long tail towards higher values suggesting occasional higher exposures. Shows a broad distribution with a peak around 4, indicating variation in monobenzyl phthalate levels. Indicates consistent but varied exposure levels.
Monoisobutyl Phthalate (hs_mibp_cadj_Log2): The distribution is right-skewed, similar to MBZP, but with a smoother decline. This pattern also indicates that while most subjects have lower exposure levels, a few experience significantly higher exposures.
Mono-n-butyl Phthalate (hs_mnbp_cadj_Log2): Peaks around 4, indicating consistent exposure levels with some variation. Few outliers are present.
numeric_chemical <- select_if(chemical_exposome, is.numeric)
cor_matrix <- cor(numeric_chemical, use = "complete.obs")
corrplot(cor_matrix, method = "color", type = "upper", tl.col = "black", tl.srt = 90, tl.cex = 0.6)
# Specified covariates
specific_covariates <- c(
"e3_sex_None",
"e3_yearbir_None",
"h_cohort",
"hs_child_age_None"
)
covariate_data <- dplyr::select(covariates, all_of(specific_covariates))
summarytools::view(dfSummary(covariate_data, style = 'grid', plain.ascii = FALSE, valid.col = FALSE, headings = FALSE), method = "render")
| No | Variable | Stats / Values | Freqs (% of Valid) | Graph | Missing | |||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | e3_sex_None [factor] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||
| 2 | e3_yearbir_None [factor] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||
| 3 | h_cohort [factor] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||
| 4 | hs_child_age_None [numeric] |
|
879 distinct values | 0 (0.0%) |
Generated by summarytools 1.0.1 (R version 4.4.0)
2024-07-21
#separate numeric and categorical data
numeric_covariates <- covariate_data %>%
dplyr::select(where(is.numeric))
numeric_covariates_long <- pivot_longer(
numeric_covariates,
cols = everything(),
names_to = "variable",
values_to = "value"
)
unique_numerical_vars <- unique(numeric_covariates_long$variable)
num_plots <- lapply(unique_numerical_vars, function(var) {
data <- filter(numeric_covariates_long, variable == var)
p <- ggplot(data, aes(x = value)) +
geom_histogram(bins = 30, fill = "blue") +
labs(title = paste("Histogram of", var), x = "Value", y = "Count")
print(p)
return(p)
})
Child’s Age (hs_child_age): This histogram is multimodal, reflecting several peaks across different ages. This could be indicative of the data collection points or particular age groups being studied.
categorical_covariates <- covariate_data %>%
dplyr::select(where(is.factor))
categorical_covariates_long <- pivot_longer(
categorical_covariates,
cols = everything(),
names_to = "variable",
values_to = "value"
)
unique_categorical_vars <- unique(categorical_covariates_long$variable)
categorical_plots <- lapply(unique_categorical_vars, function(var) {
data <- filter(categorical_covariates_long, variable == var)
p <- ggplot(data, aes(x = value, fill = value)) +
geom_bar(stat = "count") +
labs(title = paste("Distribution of", var), x = var, y = "Count")
print(p)
return(p)
})
Cohorts (h_cohort): The distribution shows the count of subjects across six different cohorts. All cohorts have a substantial number of subjects, with cohort 5 showing the highest participation.
Gender Distribution (e3_sex): The gender distribution is nearly balanced with a slight higher count for males compared to females.
Year of Birth (e3_yearbir): This chart shows that the majority of subjects were born in the later years, with a significant increase in 2009, indicating perhaps a larger recruitment or a specific cohort focus that year.
outcome_BMI <- phenotype %>%
dplyr::select(hs_zbmi_who)
summarytools::view(dfSummary(outcome_BMI, style = 'grid', plain.ascii = FALSE, valid.col = FALSE, headings = FALSE), method = "render")
| No | Variable | Stats / Values | Freqs (% of Valid) | Graph | Missing | ||||
|---|---|---|---|---|---|---|---|---|---|
| 1 | hs_zbmi_who [numeric] |
|
421 distinct values | 0 (0.0%) |
Generated by summarytools 1.0.1 (R version 4.4.0)
2024-07-21
# Combine all selected data
combined_data <- cbind(covariate_data, lifestyle_exposome, chemical_exposome, outcome_BMI)
# Ensure no duplicated columns
combined_data <- combined_data[, !duplicated(colnames(combined_data))]
# Convert sex variable to a factor for stratification
combined_data$e3_sex_None <- as.factor(combined_data$e3_sex_None)
levels(combined_data$e3_sex_None) <- c("Male", "Female")
render_cont <- function(x) {
with(stats.default(x), sprintf("%0.2f (%0.2f)", MEAN, SD))
}
render_cat <- function(x) {
c("", sapply(stats.default(x), function(y) with(y, sprintf("%d (%0.1f %%)", FREQ, PCT))))
}
# Define the formula for table1
table1_formula <- ~
hs_child_age_None + e3_yearbir_None + h_cohort +
hs_zbmi_who +
h_bfdur_Ter + hs_bakery_prod_Ter + hs_break_cer_Ter + hs_dairy_Ter + hs_fastfood_Ter + hs_org_food_Ter +
hs_proc_meat_Ter +
hs_total_fish_Ter + hs_total_fruits_Ter + hs_total_lipids_Ter + hs_total_sweets_Ter + hs_total_veg_Ter +
hs_cd_c_Log2 + hs_co_c_Log2 + hs_cs_c_Log2 + hs_cu_c_Log2 +
hs_hg_c_Log2 + hs_mo_c_Log2 + hs_dde_cadj_Log2 + hs_pcb153_cadj_Log2 +
hs_pcb170_cadj_Log2 + hs_dep_cadj_Log2 + hs_pbde153_cadj_Log2 +
hs_pfhxs_c_Log2 + hs_pfoa_c_Log2 + hs_pfos_c_Log2 + hs_prpa_cadj_Log2 +
hs_mbzp_cadj_Log2 + hs_mibp_cadj_Log2 + hs_mnbp_cadj_Log2 | e3_sex_None
# Create the table
table1(
table1_formula,
data = combined_data,
render.continuous = render_cont,
render.categorical = render_cat,
overall = TRUE,
topclass = "Rtable1-shade"
)
| Male (N=608) |
Female (N=693) |
TRUE (N=1301) |
|
|---|---|---|---|
| hs_child_age_None | 7.91 (1.58) | 8.03 (1.64) | 7.98 (1.61) |
| e3_yearbir_None | |||
| 2003 | 25 (4.1 %) | 30 (4.3 %) | 55 (4.2 %) |
| 2004 | 46 (7.6 %) | 61 (8.8 %) | 107 (8.2 %) |
| 2005 | 121 (19.9 %) | 120 (17.3 %) | 241 (18.5 %) |
| 2006 | 108 (17.8 %) | 148 (21.4 %) | 256 (19.7 %) |
| 2007 | 128 (21.1 %) | 122 (17.6 %) | 250 (19.2 %) |
| 2008 | 177 (29.1 %) | 202 (29.1 %) | 379 (29.1 %) |
| 2009 | 3 (0.5 %) | 10 (1.4 %) | 13 (1.0 %) |
| h_cohort | |||
| 1 | 97 (16.0 %) | 105 (15.2 %) | 202 (15.5 %) |
| 2 | 86 (14.1 %) | 112 (16.2 %) | 198 (15.2 %) |
| 3 | 102 (16.8 %) | 122 (17.6 %) | 224 (17.2 %) |
| 4 | 93 (15.3 %) | 114 (16.5 %) | 207 (15.9 %) |
| 5 | 129 (21.2 %) | 143 (20.6 %) | 272 (20.9 %) |
| 6 | 101 (16.6 %) | 97 (14.0 %) | 198 (15.2 %) |
| hs_zbmi_who | 0.35 (1.15) | 0.45 (1.22) | 0.40 (1.19) |
| h_bfdur_Ter | |||
| (0,10.8] | 231 (38.0 %) | 275 (39.7 %) | 506 (38.9 %) |
| (10.8,34.9] | 118 (19.4 %) | 152 (21.9 %) | 270 (20.8 %) |
| (34.9,Inf] | 259 (42.6 %) | 266 (38.4 %) | 525 (40.4 %) |
| hs_bakery_prod_Ter | |||
| (0,2] | 164 (27.0 %) | 181 (26.1 %) | 345 (26.5 %) |
| (2,6] | 188 (30.9 %) | 235 (33.9 %) | 423 (32.5 %) |
| (6,Inf] | 256 (42.1 %) | 277 (40.0 %) | 533 (41.0 %) |
| hs_break_cer_Ter | |||
| (0,1.1] | 141 (23.2 %) | 150 (21.6 %) | 291 (22.4 %) |
| (1.1,5.5] | 251 (41.3 %) | 270 (39.0 %) | 521 (40.0 %) |
| (5.5,Inf] | 216 (35.5 %) | 273 (39.4 %) | 489 (37.6 %) |
| hs_dairy_Ter | |||
| (0,14.6] | 175 (28.8 %) | 184 (26.6 %) | 359 (27.6 %) |
| (14.6,25.6] | 229 (37.7 %) | 236 (34.1 %) | 465 (35.7 %) |
| (25.6,Inf] | 204 (33.6 %) | 273 (39.4 %) | 477 (36.7 %) |
| hs_fastfood_Ter | |||
| (0,0.132] | 75 (12.3 %) | 68 (9.8 %) | 143 (11.0 %) |
| (0.132,0.5] | 273 (44.9 %) | 330 (47.6 %) | 603 (46.3 %) |
| (0.5,Inf] | 260 (42.8 %) | 295 (42.6 %) | 555 (42.7 %) |
| hs_org_food_Ter | |||
| (0,0.132] | 211 (34.7 %) | 218 (31.5 %) | 429 (33.0 %) |
| (0.132,1] | 191 (31.4 %) | 205 (29.6 %) | 396 (30.4 %) |
| (1,Inf] | 206 (33.9 %) | 270 (39.0 %) | 476 (36.6 %) |
| hs_proc_meat_Ter | |||
| (0,1.5] | 175 (28.8 %) | 191 (27.6 %) | 366 (28.1 %) |
| (1.5,4] | 227 (37.3 %) | 244 (35.2 %) | 471 (36.2 %) |
| (4,Inf] | 206 (33.9 %) | 258 (37.2 %) | 464 (35.7 %) |
| hs_total_fish_Ter | |||
| (0,1.5] | 183 (30.1 %) | 206 (29.7 %) | 389 (29.9 %) |
| (1.5,3] | 224 (36.8 %) | 230 (33.2 %) | 454 (34.9 %) |
| (3,Inf] | 201 (33.1 %) | 257 (37.1 %) | 458 (35.2 %) |
| hs_total_fruits_Ter | |||
| (0,7] | 174 (28.6 %) | 239 (34.5 %) | 413 (31.7 %) |
| (7,14.1] | 216 (35.5 %) | 191 (27.6 %) | 407 (31.3 %) |
| (14.1,Inf] | 218 (35.9 %) | 263 (38.0 %) | 481 (37.0 %) |
| hs_total_lipids_Ter | |||
| (0,3] | 193 (31.7 %) | 204 (29.4 %) | 397 (30.5 %) |
| (3,7] | 171 (28.1 %) | 232 (33.5 %) | 403 (31.0 %) |
| (7,Inf] | 244 (40.1 %) | 257 (37.1 %) | 501 (38.5 %) |
| hs_total_sweets_Ter | |||
| (0,4.1] | 149 (24.5 %) | 195 (28.1 %) | 344 (26.4 %) |
| (4.1,8.5] | 251 (41.3 %) | 265 (38.2 %) | 516 (39.7 %) |
| (8.5,Inf] | 208 (34.2 %) | 233 (33.6 %) | 441 (33.9 %) |
| hs_total_veg_Ter | |||
| (0,6] | 190 (31.2 %) | 214 (30.9 %) | 404 (31.1 %) |
| (6,8.5] | 136 (22.4 %) | 178 (25.7 %) | 314 (24.1 %) |
| (8.5,Inf] | 282 (46.4 %) | 301 (43.4 %) | 583 (44.8 %) |
| hs_cd_c_Log2 | -3.99 (0.98) | -3.95 (1.09) | -3.97 (1.04) |
| hs_co_c_Log2 | -2.37 (0.61) | -2.32 (0.64) | -2.34 (0.63) |
| hs_cs_c_Log2 | 0.44 (0.58) | 0.44 (0.57) | 0.44 (0.57) |
| hs_cu_c_Log2 | 9.81 (0.25) | 9.84 (0.22) | 9.83 (0.23) |
| hs_hg_c_Log2 | -0.24 (1.59) | -0.35 (1.75) | -0.30 (1.68) |
| hs_mo_c_Log2 | -0.32 (0.83) | -0.31 (0.96) | -0.32 (0.90) |
| hs_dde_cadj_Log2 | 4.63 (1.48) | 4.70 (1.50) | 4.67 (1.49) |
| hs_pcb153_cadj_Log2 | 3.47 (0.86) | 3.63 (0.94) | 3.56 (0.90) |
| hs_pcb170_cadj_Log2 | -0.60 (3.22) | -0.05 (2.77) | -0.31 (3.00) |
| hs_dep_cadj_Log2 | 0.27 (3.16) | 0.06 (3.25) | 0.16 (3.21) |
| hs_pbde153_cadj_Log2 | -4.66 (3.86) | -4.40 (3.80) | -4.53 (3.83) |
| hs_pfhxs_c_Log2 | -1.62 (1.30) | -1.53 (1.31) | -1.57 (1.31) |
| hs_pfoa_c_Log2 | 0.60 (0.55) | 0.62 (0.56) | 0.61 (0.55) |
| hs_pfos_c_Log2 | 0.95 (1.15) | 0.99 (1.08) | 0.97 (1.11) |
| hs_prpa_cadj_Log2 | -1.26 (3.96) | -1.91 (3.68) | -1.61 (3.82) |
| hs_mbzp_cadj_Log2 | 2.42 (1.23) | 2.47 (1.22) | 2.44 (1.22) |
| hs_mibp_cadj_Log2 | 5.54 (1.09) | 5.39 (1.12) | 5.46 (1.11) |
| hs_mnbp_cadj_Log2 | 4.77 (1.08) | 4.60 (0.96) | 4.68 (1.02) |
combined_data$h_cohort <- as.factor(combined_data$h_cohort)
# Create the table
table1(
~ hs_child_age_None + e3_sex_None + e3_yearbir_None +
hs_zbmi_who + h_bfdur_Ter + hs_bakery_prod_Ter +
hs_break_cer_Ter + hs_dairy_Ter + hs_fastfood_Ter +
hs_org_food_Ter + hs_proc_meat_Ter + hs_total_fish_Ter + hs_total_fruits_Ter +
hs_total_lipids_Ter +
hs_total_sweets_Ter + hs_total_veg_Ter +
hs_cd_c_Log2 + hs_co_c_Log2 + hs_cs_c_Log2 + hs_cu_c_Log2 +
hs_hg_c_Log2 + hs_mo_c_Log2 + hs_dde_cadj_Log2 + hs_pcb153_cadj_Log2 +
hs_pcb170_cadj_Log2 + hs_dep_cadj_Log2 + hs_pbde153_cadj_Log2 +
hs_pfhxs_c_Log2 + hs_pfoa_c_Log2 + hs_pfos_c_Log2 + hs_prpa_cadj_Log2 +
hs_mbzp_cadj_Log2 + hs_mibp_cadj_Log2 + hs_mnbp_cadj_Log2 | h_cohort,
data = combined_data,
render.continuous = render_cont,
render.categorical = render_cat,
overall = TRUE,
topclass = "Rtable1-shade"
)
| 1 (N=202) |
2 (N=198) |
3 (N=224) |
4 (N=207) |
5 (N=272) |
6 (N=198) |
TRUE (N=1301) |
|
|---|---|---|---|---|---|---|---|
| hs_child_age_None | 6.61 (0.28) | 10.82 (0.58) | 8.78 (0.58) | 6.48 (0.47) | 8.46 (0.53) | 6.51 (0.30) | 7.98 (1.61) |
| e3_sex_None | |||||||
| Male | 97 (48.0 %) | 86 (43.4 %) | 102 (45.5 %) | 93 (44.9 %) | 129 (47.4 %) | 101 (51.0 %) | 608 (46.7 %) |
| Female | 105 (52.0 %) | 112 (56.6 %) | 122 (54.5 %) | 114 (55.1 %) | 143 (52.6 %) | 97 (49.0 %) | 693 (53.3 %) |
| e3_yearbir_None | |||||||
| 2003 | 0 (0.0 %) | 55 (27.8 %) | 0 (0.0 %) | 0 (0.0 %) | 0 (0.0 %) | 0 (0.0 %) | 55 (4.2 %) |
| 2004 | 0 (0.0 %) | 107 (54.0 %) | 0 (0.0 %) | 0 (0.0 %) | 0 (0.0 %) | 0 (0.0 %) | 107 (8.2 %) |
| 2005 | 0 (0.0 %) | 36 (18.2 %) | 120 (53.6 %) | 0 (0.0 %) | 85 (31.2 %) | 0 (0.0 %) | 241 (18.5 %) |
| 2006 | 0 (0.0 %) | 0 (0.0 %) | 99 (44.2 %) | 0 (0.0 %) | 157 (57.7 %) | 0 (0.0 %) | 256 (19.7 %) |
| 2007 | 82 (40.6 %) | 0 (0.0 %) | 5 (2.2 %) | 62 (30.0 %) | 30 (11.0 %) | 71 (35.9 %) | 250 (19.2 %) |
| 2008 | 117 (57.9 %) | 0 (0.0 %) | 0 (0.0 %) | 136 (65.7 %) | 0 (0.0 %) | 126 (63.6 %) | 379 (29.1 %) |
| 2009 | 3 (1.5 %) | 0 (0.0 %) | 0 (0.0 %) | 9 (4.3 %) | 0 (0.0 %) | 1 (0.5 %) | 13 (1.0 %) |
| hs_zbmi_who | 0.20 (1.15) | 0.19 (1.13) | 0.80 (1.22) | 0.52 (1.22) | 0.09 (0.90) | 0.68 (1.37) | 0.40 (1.19) |
| h_bfdur_Ter | |||||||
| (0,10.8] | 74 (36.6 %) | 119 (60.1 %) | 70 (31.2 %) | 58 (28.0 %) | 101 (37.1 %) | 84 (42.4 %) | 506 (38.9 %) |
| (10.8,34.9] | 2 (1.0 %) | 57 (28.8 %) | 100 (44.6 %) | 30 (14.5 %) | 0 (0.0 %) | 81 (40.9 %) | 270 (20.8 %) |
| (34.9,Inf] | 126 (62.4 %) | 22 (11.1 %) | 54 (24.1 %) | 119 (57.5 %) | 171 (62.9 %) | 33 (16.7 %) | 525 (40.4 %) |
| hs_bakery_prod_Ter | |||||||
| (0,2] | 29 (14.4 %) | 41 (20.7 %) | 39 (17.4 %) | 34 (16.4 %) | 187 (68.8 %) | 15 (7.6 %) | 345 (26.5 %) |
| (2,6] | 66 (32.7 %) | 51 (25.8 %) | 89 (39.7 %) | 84 (40.6 %) | 74 (27.2 %) | 59 (29.8 %) | 423 (32.5 %) |
| (6,Inf] | 107 (53.0 %) | 106 (53.5 %) | 96 (42.9 %) | 89 (43.0 %) | 11 (4.0 %) | 124 (62.6 %) | 533 (41.0 %) |
| hs_break_cer_Ter | |||||||
| (0,1.1] | 18 (8.9 %) | 65 (32.8 %) | 61 (27.2 %) | 38 (18.4 %) | 57 (21.0 %) | 52 (26.3 %) | 291 (22.4 %) |
| (1.1,5.5] | 55 (27.2 %) | 67 (33.8 %) | 89 (39.7 %) | 101 (48.8 %) | 114 (41.9 %) | 95 (48.0 %) | 521 (40.0 %) |
| (5.5,Inf] | 129 (63.9 %) | 66 (33.3 %) | 74 (33.0 %) | 68 (32.9 %) | 101 (37.1 %) | 51 (25.8 %) | 489 (37.6 %) |
| hs_dairy_Ter | |||||||
| (0,14.6] | 21 (10.4 %) | 41 (20.7 %) | 55 (24.6 %) | 128 (61.8 %) | 76 (27.9 %) | 38 (19.2 %) | 359 (27.6 %) |
| (14.6,25.6] | 86 (42.6 %) | 49 (24.7 %) | 99 (44.2 %) | 51 (24.6 %) | 91 (33.5 %) | 89 (44.9 %) | 465 (35.7 %) |
| (25.6,Inf] | 95 (47.0 %) | 108 (54.5 %) | 70 (31.2 %) | 28 (13.5 %) | 105 (38.6 %) | 71 (35.9 %) | 477 (36.7 %) |
| hs_fastfood_Ter | |||||||
| (0,0.132] | 18 (8.9 %) | 23 (11.6 %) | 18 (8.0 %) | 51 (24.6 %) | 24 (8.8 %) | 9 (4.5 %) | 143 (11.0 %) |
| (0.132,0.5] | 40 (19.8 %) | 101 (51.0 %) | 127 (56.7 %) | 106 (51.2 %) | 169 (62.1 %) | 60 (30.3 %) | 603 (46.3 %) |
| (0.5,Inf] | 144 (71.3 %) | 74 (37.4 %) | 79 (35.3 %) | 50 (24.2 %) | 79 (29.0 %) | 129 (65.2 %) | 555 (42.7 %) |
| hs_org_food_Ter | |||||||
| (0,0.132] | 114 (56.4 %) | 51 (25.8 %) | 118 (52.7 %) | 19 (9.2 %) | 9 (3.3 %) | 118 (59.6 %) | 429 (33.0 %) |
| (0.132,1] | 40 (19.8 %) | 73 (36.9 %) | 70 (31.2 %) | 75 (36.2 %) | 109 (40.1 %) | 29 (14.6 %) | 396 (30.4 %) |
| (1,Inf] | 48 (23.8 %) | 74 (37.4 %) | 36 (16.1 %) | 113 (54.6 %) | 154 (56.6 %) | 51 (25.8 %) | 476 (36.6 %) |
| hs_proc_meat_Ter | |||||||
| (0,1.5] | 118 (58.4 %) | 47 (23.7 %) | 25 (11.2 %) | 83 (40.1 %) | 39 (14.3 %) | 54 (27.3 %) | 366 (28.1 %) |
| (1.5,4] | 32 (15.8 %) | 90 (45.5 %) | 85 (37.9 %) | 71 (34.3 %) | 85 (31.2 %) | 108 (54.5 %) | 471 (36.2 %) |
| (4,Inf] | 52 (25.7 %) | 61 (30.8 %) | 114 (50.9 %) | 53 (25.6 %) | 148 (54.4 %) | 36 (18.2 %) | 464 (35.7 %) |
| hs_total_fish_Ter | |||||||
| (0,1.5] | 82 (40.6 %) | 38 (19.2 %) | 25 (11.2 %) | 130 (62.8 %) | 38 (14.0 %) | 76 (38.4 %) | 389 (29.9 %) |
| (1.5,3] | 53 (26.2 %) | 103 (52.0 %) | 47 (21.0 %) | 57 (27.5 %) | 94 (34.6 %) | 100 (50.5 %) | 454 (34.9 %) |
| (3,Inf] | 67 (33.2 %) | 57 (28.8 %) | 152 (67.9 %) | 20 (9.7 %) | 140 (51.5 %) | 22 (11.1 %) | 458 (35.2 %) |
| hs_total_fruits_Ter | |||||||
| (0,7] | 26 (12.9 %) | 107 (54.0 %) | 83 (37.1 %) | 99 (47.8 %) | 35 (12.9 %) | 63 (31.8 %) | 413 (31.7 %) |
| (7,14.1] | 42 (20.8 %) | 45 (22.7 %) | 85 (37.9 %) | 64 (30.9 %) | 82 (30.1 %) | 89 (44.9 %) | 407 (31.3 %) |
| (14.1,Inf] | 134 (66.3 %) | 46 (23.2 %) | 56 (25.0 %) | 44 (21.3 %) | 155 (57.0 %) | 46 (23.2 %) | 481 (37.0 %) |
| hs_total_lipids_Ter | |||||||
| (0,3] | 18 (8.9 %) | 31 (15.7 %) | 151 (67.4 %) | 24 (11.6 %) | 32 (11.8 %) | 141 (71.2 %) | 397 (30.5 %) |
| (3,7] | 72 (35.6 %) | 90 (45.5 %) | 40 (17.9 %) | 74 (35.7 %) | 82 (30.1 %) | 45 (22.7 %) | 403 (31.0 %) |
| (7,Inf] | 112 (55.4 %) | 77 (38.9 %) | 33 (14.7 %) | 109 (52.7 %) | 158 (58.1 %) | 12 (6.1 %) | 501 (38.5 %) |
| hs_total_sweets_Ter | |||||||
| (0,4.1] | 50 (24.8 %) | 39 (19.7 %) | 93 (41.5 %) | 19 (9.2 %) | 89 (32.7 %) | 54 (27.3 %) | 344 (26.4 %) |
| (4.1,8.5] | 77 (38.1 %) | 61 (30.8 %) | 88 (39.3 %) | 58 (28.0 %) | 125 (46.0 %) | 107 (54.0 %) | 516 (39.7 %) |
| (8.5,Inf] | 75 (37.1 %) | 98 (49.5 %) | 43 (19.2 %) | 130 (62.8 %) | 58 (21.3 %) | 37 (18.7 %) | 441 (33.9 %) |
| hs_total_veg_Ter | |||||||
| (0,6] | 65 (32.2 %) | 53 (26.8 %) | 94 (42.0 %) | 81 (39.1 %) | 42 (15.4 %) | 69 (34.8 %) | 404 (31.1 %) |
| (6,8.5] | 41 (20.3 %) | 42 (21.2 %) | 69 (30.8 %) | 53 (25.6 %) | 57 (21.0 %) | 52 (26.3 %) | 314 (24.1 %) |
| (8.5,Inf] | 96 (47.5 %) | 103 (52.0 %) | 61 (27.2 %) | 73 (35.3 %) | 173 (63.6 %) | 77 (38.9 %) | 583 (44.8 %) |
| hs_cd_c_Log2 | -3.87 (0.84) | -4.06 (1.22) | -4.22 (1.23) | -4.16 (1.11) | -3.60 (0.74) | -3.99 (0.91) | -3.97 (1.04) |
| hs_co_c_Log2 | -2.31 (0.52) | -2.38 (0.56) | -2.46 (0.64) | -2.37 (0.64) | -2.53 (0.64) | -1.93 (0.56) | -2.34 (0.63) |
| hs_cs_c_Log2 | 0.12 (0.45) | 1.01 (0.47) | 0.61 (0.45) | -0.17 (0.39) | 0.71 (0.40) | 0.29 (0.39) | 0.44 (0.57) |
| hs_cu_c_Log2 | 9.86 (0.23) | 9.88 (0.25) | 9.83 (0.20) | 9.80 (0.21) | 9.71 (0.21) | 9.93 (0.21) | 9.83 (0.23) |
| hs_hg_c_Log2 | -0.56 (1.59) | 0.67 (1.29) | 0.92 (1.30) | -1.97 (1.49) | -0.34 (1.06) | -0.57 (1.69) | -0.30 (1.68) |
| hs_mo_c_Log2 | -0.13 (0.79) | -0.58 (1.18) | -0.55 (0.77) | -0.42 (0.84) | -0.17 (0.74) | -0.07 (0.95) | -0.32 (0.90) |
| hs_dde_cadj_Log2 | 3.81 (1.31) | 4.01 (1.28) | 4.36 (1.24) | 5.67 (1.29) | 4.26 (0.94) | 6.06 (1.41) | 4.67 (1.49) |
| hs_pcb153_cadj_Log2 | 2.73 (0.63) | 3.50 (0.76) | 3.66 (0.84) | 3.93 (0.85) | 4.22 (0.69) | 3.03 (0.68) | 3.56 (0.90) |
| hs_pcb170_cadj_Log2 | -2.44 (3.33) | 0.33 (1.89) | 0.41 (2.42) | -0.81 (3.58) | 1.38 (1.63) | -1.38 (3.14) | -0.31 (3.00) |
| hs_dep_cadj_Log2 | 1.44 (3.30) | -0.27 (3.31) | -0.15 (3.07) | -1.42 (3.25) | 0.62 (2.85) | 0.66 (2.82) | 0.16 (3.21) |
| hs_pbde153_cadj_Log2 | -3.39 (3.79) | -5.11 (3.61) | -5.05 (3.83) | -4.86 (3.78) | -2.66 (3.00) | -6.71 (3.67) | -4.53 (3.83) |
| hs_pfhxs_c_Log2 | -1.48 (1.03) | -0.51 (0.83) | -1.55 (0.88) | -2.69 (1.19) | -0.66 (0.76) | -2.83 (1.08) | -1.57 (1.31) |
| hs_pfoa_c_Log2 | 0.86 (0.50) | 0.56 (0.53) | 0.52 (0.51) | 0.42 (0.61) | 0.80 (0.43) | 0.46 (0.61) | 0.61 (0.55) |
| hs_pfos_c_Log2 | 0.57 (0.90) | 1.64 (0.78) | 0.43 (0.97) | 0.19 (1.29) | 1.67 (0.75) | 1.16 (0.88) | 0.97 (1.11) |
| hs_prpa_cadj_Log2 | -0.05 (3.69) | -2.65 (3.49) | 0.69 (3.83) | -2.00 (3.98) | -3.14 (2.92) | -2.22 (3.50) | -1.61 (3.82) |
| hs_mbzp_cadj_Log2 | 1.60 (1.16) | 2.81 (1.19) | 2.52 (1.09) | 2.81 (1.11) | 2.17 (1.11) | 2.85 (1.23) | 2.44 (1.22) |
| hs_mibp_cadj_Log2 | 6.07 (1.02) | 5.47 (1.07) | 4.88 (0.90) | 6.27 (0.87) | 4.74 (0.99) | 5.63 (0.83) | 5.46 (1.11) |
| hs_mnbp_cadj_Log2 | 4.74 (0.90) | 4.24 (0.86) | 3.99 (0.79) | 5.47 (0.86) | 4.79 (0.89) | 4.84 (1.12) | 4.68 (1.02) |
outcome_cov <- cbind(covariate_data, outcome_BMI)
outcome_cov <- outcome_cov[, !duplicated(colnames(outcome_cov))]
#the full chemicals list
chemicals_full <- c(
"hs_as_c_Log2",
"hs_cd_c_Log2",
"hs_co_c_Log2",
"hs_cs_c_Log2",
"hs_cu_c_Log2",
"hs_hg_c_Log2",
"hs_mn_c_Log2",
"hs_mo_c_Log2",
"hs_pb_c_Log2",
"hs_tl_cdich_None",
"hs_dde_cadj_Log2",
"hs_ddt_cadj_Log2",
"hs_hcb_cadj_Log2",
"hs_pcb118_cadj_Log2",
"hs_pcb138_cadj_Log2",
"hs_pcb153_cadj_Log2",
"hs_pcb170_cadj_Log2",
"hs_pcb180_cadj_Log2",
"hs_dep_cadj_Log2",
"hs_detp_cadj_Log2",
"hs_dmdtp_cdich_None",
"hs_dmp_cadj_Log2",
"hs_dmtp_cadj_Log2",
"hs_pbde153_cadj_Log2",
"hs_pbde47_cadj_Log2",
"hs_pfhxs_c_Log2",
"hs_pfna_c_Log2",
"hs_pfoa_c_Log2",
"hs_pfos_c_Log2",
"hs_pfunda_c_Log2",
"hs_bpa_cadj_Log2",
"hs_bupa_cadj_Log2",
"hs_etpa_cadj_Log2",
"hs_mepa_cadj_Log2",
"hs_oxbe_cadj_Log2",
"hs_prpa_cadj_Log2",
"hs_trcs_cadj_Log2",
"hs_mbzp_cadj_Log2",
"hs_mecpp_cadj_Log2",
"hs_mehhp_cadj_Log2",
"hs_mehp_cadj_Log2",
"hs_meohp_cadj_Log2",
"hs_mep_cadj_Log2",
"hs_mibp_cadj_Log2",
"hs_mnbp_cadj_Log2",
"hs_ohminp_cadj_Log2",
"hs_oxominp_cadj_Log2",
"hs_cotinine_cdich_None",
"hs_globalexp2_None"
)
#postnatal diet for child
postnatal_diet <- c(
"h_bfdur_Ter",
"hs_bakery_prod_Ter",
"hs_beverages_Ter",
"hs_break_cer_Ter",
"hs_caff_drink_Ter",
"hs_dairy_Ter",
"hs_fastfood_Ter",
"hs_org_food_Ter",
"hs_proc_meat_Ter",
"hs_readymade_Ter",
"hs_total_bread_Ter",
"hs_total_cereal_Ter",
"hs_total_fish_Ter",
"hs_total_fruits_Ter",
"hs_total_lipids_Ter",
"hs_total_meat_Ter",
"hs_total_potatoes_Ter",
"hs_total_sweets_Ter",
"hs_total_veg_Ter",
"hs_total_yog_Ter"
)
chemicals_columns <- c(chemicals_full)
all_chemicals <- exposome %>% dplyr::select(all_of(chemicals_columns))
diet_columns <- c(postnatal_diet)
all_diet <- exposome %>% dplyr::select(all_of(diet_columns))
all_columns <- c(chemicals_full, postnatal_diet)
extracted_exposome <- exposome %>% dplyr::select(all_of(all_columns))
chemicals_outcome_cov <- cbind(outcome_cov, all_chemicals)
diet_outcome_cov <- cbind(outcome_cov, all_diet)
interested_data <- cbind(outcome_cov, extracted_exposome)
head(interested_data)
interested_data_corr <- select_if(interested_data, is.numeric)
cor_matrix <- cor(interested_data_corr, method = "pearson")
cor_matrix <- cor(interested_data_corr, method = "spearman")
cor_matrix <- cor(interested_data_corr, use = "complete.obs")
corrplot(cor_matrix, method = "color", type = "upper", tl.col = "black", tl.srt = 90, tl.cex = 0.4)
covariates_columns <- colnames(covariate_data)
covariates_data <- interested_data %>% dplyr::select(all_of(covariates_columns))
y <- interested_data$hs_zbmi_who
set.seed(101)
trainIndex <- createDataPartition(y, p = .7, list = FALSE, times = 1)
train_data <- interested_data[trainIndex,]
test_data <- interested_data[-trainIndex,]
x_train_cov <- model.matrix(~ . -1, data = covariates_data[trainIndex,])
x_test_cov <- model.matrix(~ . -1, data = covariates_data[-trainIndex,])
y_train <- train_data$hs_zbmi_who
y_test <- test_data$hs_zbmi_who
fit_lasso_cov <- cv.glmnet(x_train_cov, y_train, alpha = 1, family = "gaussian")
plot(fit_lasso_cov, xvar = "lambda", main = "Lasso Coefficients Path")
best_lambda_lasso <- fit_lasso_cov$lambda.min
coef(fit_lasso_cov, s = best_lambda_lasso)
## 15 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 0.238040798
## e3_sex_Nonefemale -0.006897611
## e3_sex_Nonemale .
## e3_yearbir_None2004 -0.068854144
## e3_yearbir_None2005 -0.038686120
## e3_yearbir_None2006 .
## e3_yearbir_None2007 .
## e3_yearbir_None2008 0.017116731
## e3_yearbir_None2009 0.198910468
## h_cohort2 .
## h_cohort3 0.472208207
## h_cohort4 0.314382789
## h_cohort5 -0.049357036
## h_cohort6 0.387237572
## hs_child_age_None .
lasso_predictions_cov <- predict(fit_lasso_cov, s = best_lambda_lasso, newx = x_test_cov)
mse_lasso_cov <- mean((y_test - lasso_predictions_cov)^2)
rmse_lasso_cov <- sqrt(mse_lasso_cov)
cat("Lasso Test MSE (Covariates only):", mse_lasso_cov, "\n")
## Lasso Test MSE (Covariates only): 1.239095
cat("Lasso Test RMSE (Covariates only):", rmse_lasso_cov, "\n")
## Lasso Test RMSE (Covariates only): 1.113147
fit_ridge_cov <- cv.glmnet(x_train_cov, y_train, alpha = 0, family = "gaussian")
plot(fit_ridge_cov, xvar = "lambda", main = "Ridge Coefficients Path")
best_lambda_ridge <- fit_ridge_cov$lambda.min
coef(fit_ridge_cov, s = best_lambda_ridge)
## 15 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 0.09319795
## e3_sex_Nonefemale -0.01865528
## e3_sex_Nonemale 0.01853630
## e3_yearbir_None2004 -0.16725390
## e3_yearbir_None2005 -0.12077752
## e3_yearbir_None2006 -0.05083264
## e3_yearbir_None2007 -0.03585817
## e3_yearbir_None2008 0.02669345
## e3_yearbir_None2009 0.33967684
## h_cohort2 -0.09907039
## h_cohort3 0.43630080
## h_cohort4 0.28804145
## h_cohort5 -0.10836528
## h_cohort6 0.36307763
## hs_child_age_None 0.02739312
ridge_predictions_cov <- predict(fit_ridge_cov, s = best_lambda_ridge, newx = x_test_cov)
mse_ridge_cov <- mean((y_test - ridge_predictions_cov)^2)
rmse_ridge_cov <- sqrt(mse_ridge_cov)
cat("Ridge Test MSE (Covariates only):", mse_ridge_cov, "\n")
## Ridge Test MSE (Covariates only): 1.236435
cat("Ridge Test RMSE (Covariates only):", rmse_ridge_cov, "\n")
## Ridge Test RMSE (Covariates only): 1.111951
fit_enet_cov <- cv.glmnet(x_train_cov, y_train, alpha = 0.5, family = "gaussian")
plot(fit_enet_cov, xvar = "lambda", main = "Elastic Net Coefficients Path")
best_lambda_enet <- fit_enet_cov$lambda.min
coef(fit_enet_cov, s = best_lambda_enet)
## 15 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 0.243795099
## e3_sex_Nonefemale -0.008612618
## e3_sex_Nonemale 0.001147480
## e3_yearbir_None2004 -0.078883364
## e3_yearbir_None2005 -0.041622954
## e3_yearbir_None2006 .
## e3_yearbir_None2007 .
## e3_yearbir_None2008 0.019281438
## e3_yearbir_None2009 0.213563502
## h_cohort2 .
## h_cohort3 0.466431092
## h_cohort4 0.306885448
## h_cohort5 -0.056625036
## h_cohort6 0.379711766
## hs_child_age_None .
enet_predictions_cov <- predict(fit_enet_cov, s = best_lambda_enet, newx = x_test_cov)
mse_enet_cov <- mean((y_test - enet_predictions_cov)^2)
rmse_enet_cov <- sqrt(mse_enet_cov)
cat("Elastic Net Test MSE (Covariates only):", mse_enet_cov, "\n")
## Elastic Net Test MSE (Covariates only): 1.238834
cat("Elastic Net Test RMSE (Covariates only):", rmse_enet_cov, "\n")
## Elastic Net Test RMSE (Covariates only): 1.113029
#LASSO train/test 70-30
set.seed(101)
train_indices <- sample(seq_len(nrow(chemicals_outcome_cov)), size = floor(0.7 * nrow(interested_data)))
test_indices <- setdiff(seq_len(nrow(chemicals_outcome_cov)), train_indices)
x_train <- as.matrix(chemicals_outcome_cov[train_indices, setdiff(names(chemicals_outcome_cov), "hs_zbmi_who")])
y_train <- chemicals_outcome_cov$hs_zbmi_who[train_indices]
x_test <- as.matrix(chemicals_outcome_cov[test_indices, setdiff(names(chemicals_outcome_cov), "hs_zbmi_who")])
y_test <- chemicals_outcome_cov$hs_zbmi_who[test_indices]
x_train_chemicals_only <- as.matrix(chemicals_outcome_cov[train_indices, chemicals_full])
x_test_chemicals_only <- as.matrix(chemicals_outcome_cov[test_indices, chemicals_full])
fit_without_covariates_train <- cv.glmnet(x_train_chemicals_only, y_train, alpha = 1, family = "gaussian")
fit_without_covariates_test <- predict(fit_without_covariates_train, s = "lambda.min", newx = x_test_chemicals_only)
test_mse_without_covariates <- mean((y_test - fit_without_covariates_test)^2)
plot(fit_without_covariates_train, xvar = "lambda", main = "Coefficients Path (Without Covariates)")
best_lambda <- fit_without_covariates_train$lambda.min # lambda that minimizes the MSE
coef(fit_without_covariates_train, s = best_lambda)
## 50 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) -4.7797230131
## hs_as_c_Log2 .
## hs_cd_c_Log2 -0.0238815730
## hs_co_c_Log2 -0.0011670319
## hs_cs_c_Log2 0.0771865955
## hs_cu_c_Log2 0.6071183261
## hs_hg_c_Log2 -0.0075730086
## hs_mn_c_Log2 .
## hs_mo_c_Log2 -0.0992489424
## hs_pb_c_Log2 -0.0056257448
## hs_tl_cdich_None .
## hs_dde_cadj_Log2 -0.0378984008
## hs_ddt_cadj_Log2 .
## hs_hcb_cadj_Log2 .
## hs_pcb118_cadj_Log2 .
## hs_pcb138_cadj_Log2 .
## hs_pcb153_cadj_Log2 -0.1721262187
## hs_pcb170_cadj_Log2 -0.0557570999
## hs_pcb180_cadj_Log2 .
## hs_dep_cadj_Log2 -0.0186165147
## hs_detp_cadj_Log2 .
## hs_dmdtp_cdich_None .
## hs_dmp_cadj_Log2 .
## hs_dmtp_cadj_Log2 .
## hs_pbde153_cadj_Log2 -0.0357794002
## hs_pbde47_cadj_Log2 .
## hs_pfhxs_c_Log2 -0.0019079468
## hs_pfna_c_Log2 .
## hs_pfoa_c_Log2 -0.1360824261
## hs_pfos_c_Log2 -0.0478302901
## hs_pfunda_c_Log2 .
## hs_bpa_cadj_Log2 .
## hs_bupa_cadj_Log2 .
## hs_etpa_cadj_Log2 .
## hs_mepa_cadj_Log2 .
## hs_oxbe_cadj_Log2 0.0008622765
## hs_prpa_cadj_Log2 0.0011728557
## hs_trcs_cadj_Log2 .
## hs_mbzp_cadj_Log2 0.0373221816
## hs_mecpp_cadj_Log2 .
## hs_mehhp_cadj_Log2 .
## hs_mehp_cadj_Log2 .
## hs_meohp_cadj_Log2 .
## hs_mep_cadj_Log2 .
## hs_mibp_cadj_Log2 -0.0477304169
## hs_mnbp_cadj_Log2 -0.0036235331
## hs_ohminp_cadj_Log2 .
## hs_oxominp_cadj_Log2 .
## hs_cotinine_cdich_None .
## hs_globalexp2_None .
cat("Model without Covariates - Test MSE:", test_mse_without_covariates, "\n")
## Model without Covariates - Test MSE: 1.231997
# RIDGE
fit_without_covariates_train <- cv.glmnet(x_train_chemicals_only, y_train, alpha = 0, family = "gaussian")
fit_without_covariates_test <- predict(fit_without_covariates_train, s = "lambda.min", newx = x_test_chemicals_only)
test_mse_without_covariates <- mean((y_test - fit_without_covariates_test)^2)
plot(fit_without_covariates_train, xvar = "lambda", main = "Coefficients Path (Without Covariates)")
best_lambda <- fit_without_covariates_train$lambda.min # lambda that minimizes the MSE
coef(fit_without_covariates_train, s = best_lambda)
## 50 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) -4.469806e+00
## hs_as_c_Log2 6.590433e-03
## hs_cd_c_Log2 -4.093355e-02
## hs_co_c_Log2 -5.049922e-02
## hs_cs_c_Log2 1.230373e-01
## hs_cu_c_Log2 6.078479e-01
## hs_hg_c_Log2 -3.225520e-02
## hs_mn_c_Log2 -3.089195e-02
## hs_mo_c_Log2 -1.068154e-01
## hs_pb_c_Log2 -5.295956e-02
## hs_tl_cdich_None .
## hs_dde_cadj_Log2 -4.888006e-02
## hs_ddt_cadj_Log2 4.045085e-03
## hs_hcb_cadj_Log2 -1.857150e-02
## hs_pcb118_cadj_Log2 1.400112e-02
## hs_pcb138_cadj_Log2 -3.614513e-02
## hs_pcb153_cadj_Log2 -1.223407e-01
## hs_pcb170_cadj_Log2 -5.267521e-02
## hs_pcb180_cadj_Log2 -1.074695e-02
## hs_dep_cadj_Log2 -2.548881e-02
## hs_detp_cadj_Log2 8.051621e-03
## hs_dmdtp_cdich_None .
## hs_dmp_cadj_Log2 -2.097690e-03
## hs_dmtp_cadj_Log2 7.300567e-05
## hs_pbde153_cadj_Log2 -3.315313e-02
## hs_pbde47_cadj_Log2 5.273953e-03
## hs_pfhxs_c_Log2 -2.966308e-02
## hs_pfna_c_Log2 2.336166e-02
## hs_pfoa_c_Log2 -1.519872e-01
## hs_pfos_c_Log2 -6.495855e-02
## hs_pfunda_c_Log2 1.248503e-02
## hs_bpa_cadj_Log2 3.832688e-04
## hs_bupa_cadj_Log2 6.588467e-03
## hs_etpa_cadj_Log2 -6.098679e-03
## hs_mepa_cadj_Log2 -1.638466e-02
## hs_oxbe_cadj_Log2 1.390524e-02
## hs_prpa_cadj_Log2 1.258510e-02
## hs_trcs_cadj_Log2 2.878805e-03
## hs_mbzp_cadj_Log2 5.550048e-02
## hs_mecpp_cadj_Log2 1.627174e-03
## hs_mehhp_cadj_Log2 2.316991e-02
## hs_mehp_cadj_Log2 -1.662304e-02
## hs_meohp_cadj_Log2 1.137436e-02
## hs_mep_cadj_Log2 3.371106e-03
## hs_mibp_cadj_Log2 -5.391219e-02
## hs_mnbp_cadj_Log2 -4.383016e-02
## hs_ohminp_cadj_Log2 -2.886768e-02
## hs_oxominp_cadj_Log2 2.204660e-02
## hs_cotinine_cdich_None .
## hs_globalexp2_None .
cat("Model without Covariates - Test MSE:", test_mse_without_covariates, "\n")
## Model without Covariates - Test MSE: 1.188752
# ELASTIC NET
fit_without_covariates_train <- cv.glmnet(x_train_chemicals_only, y_train, alpha = 0.5, family = "gaussian")
fit_without_covariates_test <- predict(fit_without_covariates_train, s = "lambda.min", newx = x_test_chemicals_only)
test_mse_without_covariates <- mean((y_test - fit_without_covariates_test)^2)
plot(fit_without_covariates_train, xvar = "lambda", main = "Coefficients Path (Without Covariates)")
best_lambda <- fit_without_covariates_train$lambda.min # lambda that minimizes the MSE
coef(fit_without_covariates_train, s = best_lambda)
## 50 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) -4.785950188
## hs_as_c_Log2 .
## hs_cd_c_Log2 -0.025843356
## hs_co_c_Log2 -0.005835867
## hs_cs_c_Log2 0.084715330
## hs_cu_c_Log2 0.607379616
## hs_hg_c_Log2 -0.009800093
## hs_mn_c_Log2 .
## hs_mo_c_Log2 -0.099724922
## hs_pb_c_Log2 -0.010318890
## hs_tl_cdich_None .
## hs_dde_cadj_Log2 -0.039528137
## hs_ddt_cadj_Log2 .
## hs_hcb_cadj_Log2 .
## hs_pcb118_cadj_Log2 .
## hs_pcb138_cadj_Log2 .
## hs_pcb153_cadj_Log2 -0.169008355
## hs_pcb170_cadj_Log2 -0.055808065
## hs_pcb180_cadj_Log2 .
## hs_dep_cadj_Log2 -0.019034348
## hs_detp_cadj_Log2 .
## hs_dmdtp_cdich_None .
## hs_dmp_cadj_Log2 .
## hs_dmtp_cadj_Log2 .
## hs_pbde153_cadj_Log2 -0.035464586
## hs_pbde47_cadj_Log2 .
## hs_pfhxs_c_Log2 -0.006816020
## hs_pfna_c_Log2 .
## hs_pfoa_c_Log2 -0.135997766
## hs_pfos_c_Log2 -0.047692264
## hs_pfunda_c_Log2 .
## hs_bpa_cadj_Log2 .
## hs_bupa_cadj_Log2 .
## hs_etpa_cadj_Log2 .
## hs_mepa_cadj_Log2 .
## hs_oxbe_cadj_Log2 0.002529961
## hs_prpa_cadj_Log2 0.001735800
## hs_trcs_cadj_Log2 .
## hs_mbzp_cadj_Log2 0.040317847
## hs_mecpp_cadj_Log2 .
## hs_mehhp_cadj_Log2 .
## hs_mehp_cadj_Log2 .
## hs_meohp_cadj_Log2 .
## hs_mep_cadj_Log2 .
## hs_mibp_cadj_Log2 -0.047892677
## hs_mnbp_cadj_Log2 -0.008483913
## hs_ohminp_cadj_Log2 .
## hs_oxominp_cadj_Log2 .
## hs_cotinine_cdich_None .
## hs_globalexp2_None .
cat("Model without Covariates - Test MSE:", test_mse_without_covariates, "\n")
## Model without Covariates - Test MSE: 1.228805
#selected chemicals that were noted in enet
chemicals_selected <- c(
"hs_cd_c_Log2",
"hs_co_c_Log2",
"hs_cs_c_Log2",
"hs_cu_c_Log2",
"hs_hg_c_Log2",
"hs_mo_c_Log2",
"hs_pb_c_Log2",
"hs_dde_cadj_Log2",
"hs_pcb153_cadj_Log2",
"hs_pcb170_cadj_Log2",
"hs_dep_cadj_Log2",
"hs_detp_cadj_Log2",
"hs_pbde153_cadj_Log2",
"hs_pfhxs_c_Log2",
"hs_pfoa_c_Log2",
"hs_pfos_c_Log2",
"hs_mepa_cadj_Log2",
"hs_oxbe_cadj_Log2",
"hs_prpa_cadj_Log2",
"hs_mbzp_cadj_Log2",
"hs_mibp_cadj_Log2",
"hs_mnbp_cadj_Log2")
The features for chemicals were selected due to the feature selections of elastic net.
# LASSO with train/test
set.seed(101)
train_indices <- sample(seq_len(nrow(diet_outcome_cov)), size = floor(0.7 * nrow(diet_outcome_cov)))
test_indices <- setdiff(seq_len(nrow(diet_outcome_cov)), train_indices)
diet_data <- diet_outcome_cov[, postnatal_diet]
x_diet_train <- model.matrix(~ . + 0, data = diet_data[train_indices, ])
x_diet_test <- model.matrix(~ . + 0, data = diet_data[test_indices, ])
covariates <- diet_outcome_cov[, c("e3_sex_None", "e3_yearbir_None", "h_cohort", "hs_child_age_None")]
x_covariates_train <- model.matrix(~ . + 0, data = covariates[train_indices, ])
x_covariates_test <- model.matrix(~ . + 0, data = covariates[test_indices, ])
x_full_train <- cbind(x_diet_train, x_covariates_train)
x_full_test <- cbind(x_diet_test, x_covariates_test)
x_full_train[is.na(x_full_train)] <- 0
x_full_test[is.na(x_full_test)] <- 0
x_diet_train[is.na(x_diet_train)] <- 0
x_diet_test[is.na(x_diet_test)] <- 0
y_train <- as.numeric(diet_outcome_cov$hs_zbmi_who[train_indices])
y_test <- as.numeric(diet_outcome_cov$hs_zbmi_who[test_indices])
# fit models
fit_without_covariates <- cv.glmnet(x_diet_train, y_train, alpha = 1, family = "gaussian")
fit_without_covariates
##
## Call: cv.glmnet(x = x_diet_train, y = y_train, alpha = 1, family = "gaussian")
##
## Measure: Mean-Squared Error
##
## Lambda Index Measure SE Nonzero
## min 0.06922 9 1.431 0.06022 5
## 1se 0.14570 1 1.442 0.06160 0
plot(fit_without_covariates, xvar = "lambda", main = "Coefficient Path (Without Covariates)")
best_lambda <- fit_without_covariates$lambda.min # lambda that minimizes the MSE
coef(fit_without_covariates, s = best_lambda)
## 41 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 0.53256344
## h_bfdur_Ter(0,10.8] .
## h_bfdur_Ter(10.8,34.9] .
## h_bfdur_Ter(34.9,Inf] .
## hs_bakery_prod_Ter(2,6] .
## hs_bakery_prod_Ter(6,Inf] .
## hs_beverages_Ter(0.132,1] .
## hs_beverages_Ter(1,Inf] .
## hs_break_cer_Ter(1.1,5.5] .
## hs_break_cer_Ter(5.5,Inf] .
## hs_caff_drink_Ter(0.132,Inf] .
## hs_dairy_Ter(14.6,25.6] .
## hs_dairy_Ter(25.6,Inf] .
## hs_fastfood_Ter(0.132,0.5] .
## hs_fastfood_Ter(0.5,Inf] .
## hs_org_food_Ter(0.132,1] .
## hs_org_food_Ter(1,Inf] -0.13588632
## hs_proc_meat_Ter(1.5,4] .
## hs_proc_meat_Ter(4,Inf] .
## hs_readymade_Ter(0.132,0.5] .
## hs_readymade_Ter(0.5,Inf] .
## hs_total_bread_Ter(7,17.5] .
## hs_total_bread_Ter(17.5,Inf] .
## hs_total_cereal_Ter(14.1,23.6] .
## hs_total_cereal_Ter(23.6,Inf] .
## hs_total_fish_Ter(1.5,3] .
## hs_total_fish_Ter(3,Inf] .
## hs_total_fruits_Ter(7,14.1] .
## hs_total_fruits_Ter(14.1,Inf] -0.02481964
## hs_total_lipids_Ter(3,7] .
## hs_total_lipids_Ter(7,Inf] -0.05164312
## hs_total_meat_Ter(6,9] .
## hs_total_meat_Ter(9,Inf] .
## hs_total_potatoes_Ter(3,4] .
## hs_total_potatoes_Ter(4,Inf] .
## hs_total_sweets_Ter(4.1,8.5] -0.01594403
## hs_total_sweets_Ter(8.5,Inf] .
## hs_total_veg_Ter(6,8.5] .
## hs_total_veg_Ter(8.5,Inf] -0.08180563
## hs_total_yog_Ter(6,8.5] .
## hs_total_yog_Ter(8.5,Inf] .
predictions_without_covariates <- predict(fit_without_covariates, s = "lambda.min", newx = x_diet_test)
mse_without_covariates <- mean((y_test - predictions_without_covariates)^2)
cat("Model without Covariates - Test MSE:", mse_without_covariates, "\n")
## Model without Covariates - Test MSE: 1.34942
# RIDGE
fit_without_covariates <- cv.glmnet(x_diet_train, y_train, alpha = 0, family = "gaussian")
fit_without_covariates
##
## Call: cv.glmnet(x = x_diet_train, y = y_train, alpha = 0, family = "gaussian")
##
## Measure: Mean-Squared Error
##
## Lambda Index Measure SE Nonzero
## min 3.53 41 1.431 0.08497 40
## 1se 145.70 1 1.441 0.08233 40
plot(fit_without_covariates, xvar = "lambda", main = "Coefficient Path (Without Covariates)")
best_lambda <- fit_without_covariates$lambda.min # lambda that minimizes the MSE
coef(fit_without_covariates, s = best_lambda)
## 41 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 0.5163069457
## h_bfdur_Ter(0,10.8] -0.0114164662
## h_bfdur_Ter(10.8,34.9] 0.0353770607
## h_bfdur_Ter(34.9,Inf] -0.0138651651
## hs_bakery_prod_Ter(2,6] 0.0228606785
## hs_bakery_prod_Ter(6,Inf] -0.0268639952
## hs_beverages_Ter(0.132,1] -0.0065939314
## hs_beverages_Ter(1,Inf] -0.0016124215
## hs_break_cer_Ter(1.1,5.5] -0.0034207548
## hs_break_cer_Ter(5.5,Inf] -0.0337182186
## hs_caff_drink_Ter(0.132,Inf] -0.0143879393
## hs_dairy_Ter(14.6,25.6] 0.0355023507
## hs_dairy_Ter(25.6,Inf] -0.0005581647
## hs_fastfood_Ter(0.132,0.5] 0.0161761119
## hs_fastfood_Ter(0.5,Inf] -0.0001750742
## hs_org_food_Ter(0.132,1] 0.0151677373
## hs_org_food_Ter(1,Inf] -0.0682466785
## hs_proc_meat_Ter(1.5,4] 0.0222199344
## hs_proc_meat_Ter(4,Inf] -0.0187135643
## hs_readymade_Ter(0.132,0.5] -0.0013536008
## hs_readymade_Ter(0.5,Inf] 0.0105115509
## hs_total_bread_Ter(7,17.5] -0.0035702530
## hs_total_bread_Ter(17.5,Inf] -0.0070550360
## hs_total_cereal_Ter(14.1,23.6] 0.0082269928
## hs_total_cereal_Ter(23.6,Inf] -0.0131001584
## hs_total_fish_Ter(1.5,3] -0.0346609367
## hs_total_fish_Ter(3,Inf] -0.0051749487
## hs_total_fruits_Ter(7,14.1] 0.0266413533
## hs_total_fruits_Ter(14.1,Inf] -0.0389551124
## hs_total_lipids_Ter(3,7] -0.0022752284
## hs_total_lipids_Ter(7,Inf] -0.0476627593
## hs_total_meat_Ter(6,9] 0.0007524275
## hs_total_meat_Ter(9,Inf] 0.0005196923
## hs_total_potatoes_Ter(3,4] 0.0105526823
## hs_total_potatoes_Ter(4,Inf] 0.0048180175
## hs_total_sweets_Ter(4.1,8.5] -0.0392140671
## hs_total_sweets_Ter(8.5,Inf] -0.0010028529
## hs_total_veg_Ter(6,8.5] 0.0009962184
## hs_total_veg_Ter(8.5,Inf] -0.0556956882
## hs_total_yog_Ter(6,8.5] -0.0102351610
## hs_total_yog_Ter(8.5,Inf] -0.0089303177
predictions_without_covariates <- predict(fit_without_covariates, s = "lambda.min", newx = x_diet_test)
mse_without_covariates <- mean((y_test - predictions_without_covariates)^2)
cat("Model without Covariates - Test MSE:", mse_without_covariates, "\n")
## Model without Covariates - Test MSE: 1.326308
#ELASTIC NET
fit_without_covariates <- cv.glmnet(x_diet_train, y_train, alpha = 0.5, family = "gaussian")
fit_without_covariates
##
## Call: cv.glmnet(x = x_diet_train, y = y_train, alpha = 0.5, family = "gaussian")
##
## Measure: Mean-Squared Error
##
## Lambda Index Measure SE Nonzero
## min 0.07218 16 1.430 0.05641 12
## 1se 0.29139 1 1.444 0.05877 0
plot(fit_without_covariates, xvar = "lambda", main = "Coefficient Path (Without Covariates)")
best_lambda <- fit_without_covariates$lambda.min # lambda that minimizes the MSE
coef(fit_without_covariates, s = best_lambda)
## 41 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 0.650606526
## h_bfdur_Ter(0,10.8] .
## h_bfdur_Ter(10.8,34.9] 0.039832328
## h_bfdur_Ter(34.9,Inf] .
## hs_bakery_prod_Ter(2,6] .
## hs_bakery_prod_Ter(6,Inf] -0.052635590
## hs_beverages_Ter(0.132,1] .
## hs_beverages_Ter(1,Inf] .
## hs_break_cer_Ter(1.1,5.5] .
## hs_break_cer_Ter(5.5,Inf] -0.054788470
## hs_caff_drink_Ter(0.132,Inf] .
## hs_dairy_Ter(14.6,25.6] 0.053455833
## hs_dairy_Ter(25.6,Inf] .
## hs_fastfood_Ter(0.132,0.5] .
## hs_fastfood_Ter(0.5,Inf] .
## hs_org_food_Ter(0.132,1] .
## hs_org_food_Ter(1,Inf] -0.185235916
## hs_proc_meat_Ter(1.5,4] 0.008558872
## hs_proc_meat_Ter(4,Inf] .
## hs_readymade_Ter(0.132,0.5] .
## hs_readymade_Ter(0.5,Inf] .
## hs_total_bread_Ter(7,17.5] .
## hs_total_bread_Ter(17.5,Inf] .
## hs_total_cereal_Ter(14.1,23.6] .
## hs_total_cereal_Ter(23.6,Inf] .
## hs_total_fish_Ter(1.5,3] -0.057540803
## hs_total_fish_Ter(3,Inf] .
## hs_total_fruits_Ter(7,14.1] 0.017171763
## hs_total_fruits_Ter(14.1,Inf] -0.054914989
## hs_total_lipids_Ter(3,7] .
## hs_total_lipids_Ter(7,Inf] -0.094342286
## hs_total_meat_Ter(6,9] .
## hs_total_meat_Ter(9,Inf] .
## hs_total_potatoes_Ter(3,4] .
## hs_total_potatoes_Ter(4,Inf] .
## hs_total_sweets_Ter(4.1,8.5] -0.089860153
## hs_total_sweets_Ter(8.5,Inf] .
## hs_total_veg_Ter(6,8.5] .
## hs_total_veg_Ter(8.5,Inf] -0.118161721
## hs_total_yog_Ter(6,8.5] .
## hs_total_yog_Ter(8.5,Inf] .
predictions_without_covariates <- predict(fit_without_covariates, s = "lambda.min", newx = x_diet_test)
mse_without_covariates <- mean((y_test - predictions_without_covariates)^2)
cat("Model without Covariates - Test MSE:", mse_without_covariates, "\n")
## Model without Covariates - Test MSE: 1.335144
#selected diets that were noted in enet
diet_selected <- c(
"h_bfdur_Ter",
"hs_bakery_prod_Ter",
"hs_break_cer_Ter",
"hs_dairy_Ter",
"hs_fastfood_Ter",
"hs_org_food_Ter",
"hs_proc_meat_Ter",
"hs_total_fish_Ter",
"hs_total_fruits_Ter",
"hs_total_lipids_Ter",
"hs_total_sweets_Ter",
"hs_total_veg_Ter"
)
set.seed(101)
train_indices <- sample(seq_len(nrow(interested_data)), size = floor(0.7 * nrow(interested_data)))
test_indices <- setdiff(seq_len(nrow(interested_data)), train_indices)
diet_data <- interested_data[, postnatal_diet]
x_diet_train <- model.matrix(~ . + 0, data = diet_data[train_indices, ])
x_diet_test <- model.matrix(~ . + 0, data = diet_data[test_indices, ])
chemical_data <- interested_data[, chemicals_full]
x_chemical_train <- as.matrix(chemical_data[train_indices, ])
x_chemical_test <- as.matrix(chemical_data[test_indices, ])
covariates <- interested_data[, c("e3_sex_None", "e3_yearbir_None", "h_cohort", "hs_child_age_None")]
x_covariates_train <- model.matrix(~ . + 0, data = covariates[train_indices, ])
x_covariates_test <- model.matrix(~ . + 0, data = covariates[test_indices, ])
# combine diet and chemical data with and without covariates
x_combined_train <- cbind(x_diet_train, x_chemical_train)
x_combined_test <- cbind(x_diet_test, x_chemical_test)
x_full_train <- cbind(x_combined_train, x_covariates_train)
x_full_test <- cbind(x_combined_test, x_covariates_test)
# make sure no missing values
x_full_train[is.na(x_full_train)] <- 0
x_full_test[is.na(x_full_test)] <- 0
x_combined_train[is.na(x_combined_train)] <- 0
x_combined_test[is.na(x_combined_test)] <- 0
y_train <- as.numeric(interested_data$hs_zbmi_who[train_indices])
y_test <- as.numeric(interested_data$hs_zbmi_who[test_indices])
# LASSO
fit_without_covariates <- cv.glmnet(x_combined_train, y_train, alpha = 1, family = "gaussian")
predictions_without_covariates <- predict(fit_without_covariates, s = "lambda.min", newx = x_combined_test)
mse_without_covariates <- mean((y_test - predictions_without_covariates)^2)
plot(fit_without_covariates, xvar = "lambda", main = "Coefficient Path (Without Covariates)")
best_lambda <- fit_without_covariates$lambda.min # lambda that minimizes the MSE
coef(fit_without_covariates, s = best_lambda)
## 90 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) -5.016149911
## h_bfdur_Ter(0,10.8] -0.129594522
## h_bfdur_Ter(10.8,34.9] .
## h_bfdur_Ter(34.9,Inf] .
## hs_bakery_prod_Ter(2,6] .
## hs_bakery_prod_Ter(6,Inf] -0.217291423
## hs_beverages_Ter(0.132,1] .
## hs_beverages_Ter(1,Inf] .
## hs_break_cer_Ter(1.1,5.5] .
## hs_break_cer_Ter(5.5,Inf] .
## hs_caff_drink_Ter(0.132,Inf] .
## hs_dairy_Ter(14.6,25.6] 0.009808165
## hs_dairy_Ter(25.6,Inf] .
## hs_fastfood_Ter(0.132,0.5] 0.070972556
## hs_fastfood_Ter(0.5,Inf] .
## hs_org_food_Ter(0.132,1] .
## hs_org_food_Ter(1,Inf] .
## hs_proc_meat_Ter(1.5,4] .
## hs_proc_meat_Ter(4,Inf] .
## hs_readymade_Ter(0.132,0.5] .
## hs_readymade_Ter(0.5,Inf] 0.011160944
## hs_total_bread_Ter(7,17.5] -0.010168208
## hs_total_bread_Ter(17.5,Inf] .
## hs_total_cereal_Ter(14.1,23.6] .
## hs_total_cereal_Ter(23.6,Inf] .
## hs_total_fish_Ter(1.5,3] -0.024288530
## hs_total_fish_Ter(3,Inf] .
## hs_total_fruits_Ter(7,14.1] .
## hs_total_fruits_Ter(14.1,Inf] -0.016129393
## hs_total_lipids_Ter(3,7] .
## hs_total_lipids_Ter(7,Inf] -0.047350302
## hs_total_meat_Ter(6,9] .
## hs_total_meat_Ter(9,Inf] .
## hs_total_potatoes_Ter(3,4] 0.018317955
## hs_total_potatoes_Ter(4,Inf] .
## hs_total_sweets_Ter(4.1,8.5] -0.006515994
## hs_total_sweets_Ter(8.5,Inf] .
## hs_total_veg_Ter(6,8.5] .
## hs_total_veg_Ter(8.5,Inf] -0.041036632
## hs_total_yog_Ter(6,8.5] .
## hs_total_yog_Ter(8.5,Inf] .
## hs_as_c_Log2 .
## hs_cd_c_Log2 -0.022337287
## hs_co_c_Log2 -0.003616434
## hs_cs_c_Log2 0.070483114
## hs_cu_c_Log2 0.656568320
## hs_hg_c_Log2 -0.012267249
## hs_mn_c_Log2 .
## hs_mo_c_Log2 -0.097496432
## hs_pb_c_Log2 .
## hs_tl_cdich_None .
## hs_dde_cadj_Log2 -0.029771276
## hs_ddt_cadj_Log2 .
## hs_hcb_cadj_Log2 .
## hs_pcb118_cadj_Log2 .
## hs_pcb138_cadj_Log2 .
## hs_pcb153_cadj_Log2 -0.226942147
## hs_pcb170_cadj_Log2 -0.054403335
## hs_pcb180_cadj_Log2 .
## hs_dep_cadj_Log2 -0.017878387
## hs_detp_cadj_Log2 .
## hs_dmdtp_cdich_None .
## hs_dmp_cadj_Log2 .
## hs_dmtp_cadj_Log2 .
## hs_pbde153_cadj_Log2 -0.035568595
## hs_pbde47_cadj_Log2 .
## hs_pfhxs_c_Log2 .
## hs_pfna_c_Log2 .
## hs_pfoa_c_Log2 -0.125219198
## hs_pfos_c_Log2 -0.047655946
## hs_pfunda_c_Log2 .
## hs_bpa_cadj_Log2 .
## hs_bupa_cadj_Log2 .
## hs_etpa_cadj_Log2 .
## hs_mepa_cadj_Log2 .
## hs_oxbe_cadj_Log2 .
## hs_prpa_cadj_Log2 .
## hs_trcs_cadj_Log2 .
## hs_mbzp_cadj_Log2 0.043689764
## hs_mecpp_cadj_Log2 .
## hs_mehhp_cadj_Log2 .
## hs_mehp_cadj_Log2 .
## hs_meohp_cadj_Log2 .
## hs_mep_cadj_Log2 .
## hs_mibp_cadj_Log2 -0.040902710
## hs_mnbp_cadj_Log2 -0.007173325
## hs_ohminp_cadj_Log2 .
## hs_oxominp_cadj_Log2 .
## hs_cotinine_cdich_None .
## hs_globalexp2_None .
cat("Model without Covariates - Test MSE:", mse_without_covariates, "\n")
## Model without Covariates - Test MSE: 1.200253
# RIDGE
fit_without_covariates <- cv.glmnet(x_combined_train, y_train, alpha = 0, family = "gaussian")
predictions_without_covariates <- predict(fit_without_covariates, s = "lambda.min", newx = x_combined_test)
mse_without_covariates <- mean((y_test - predictions_without_covariates)^2)
plot(fit_without_covariates, xvar = "lambda", main = "Coefficient Path (Without Covariates)")
best_lambda <- fit_without_covariates$lambda.min # lambda that minimizes the MSE
coef(fit_without_covariates, s = best_lambda)
## 90 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) -3.7486876482
## h_bfdur_Ter(0,10.8] -0.0862270481
## h_bfdur_Ter(10.8,34.9] 0.0187498222
## h_bfdur_Ter(34.9,Inf] 0.0718972907
## hs_bakery_prod_Ter(2,6] -0.0033853186
## hs_bakery_prod_Ter(6,Inf] -0.1580980396
## hs_beverages_Ter(0.132,1] 0.0052318976
## hs_beverages_Ter(1,Inf] -0.0339118523
## hs_break_cer_Ter(1.1,5.5] 0.0042988311
## hs_break_cer_Ter(5.5,Inf] -0.0503391950
## hs_caff_drink_Ter(0.132,Inf] 0.0156001183
## hs_dairy_Ter(14.6,25.6] 0.0416574408
## hs_dairy_Ter(25.6,Inf] -0.0174860568
## hs_fastfood_Ter(0.132,0.5] 0.0650667870
## hs_fastfood_Ter(0.5,Inf] -0.0300919849
## hs_org_food_Ter(0.132,1] 0.0284491409
## hs_org_food_Ter(1,Inf] -0.0490021669
## hs_proc_meat_Ter(1.5,4] 0.0055207383
## hs_proc_meat_Ter(4,Inf] -0.0063080789
## hs_readymade_Ter(0.132,0.5] 0.0307292842
## hs_readymade_Ter(0.5,Inf] 0.0632539981
## hs_total_bread_Ter(7,17.5] -0.0544944827
## hs_total_bread_Ter(17.5,Inf] 0.0146129335
## hs_total_cereal_Ter(14.1,23.6] -0.0004875292
## hs_total_cereal_Ter(23.6,Inf] 0.0180167268
## hs_total_fish_Ter(1.5,3] -0.0683250014
## hs_total_fish_Ter(3,Inf] 0.0112125503
## hs_total_fruits_Ter(7,14.1] 0.0353241028
## hs_total_fruits_Ter(14.1,Inf] -0.0433100932
## hs_total_lipids_Ter(3,7] -0.0171427895
## hs_total_lipids_Ter(7,Inf] -0.0848619938
## hs_total_meat_Ter(6,9] 0.0172861408
## hs_total_meat_Ter(9,Inf] 0.0044053472
## hs_total_potatoes_Ter(3,4] 0.0536415284
## hs_total_potatoes_Ter(4,Inf] -0.0115575388
## hs_total_sweets_Ter(4.1,8.5] -0.0692484887
## hs_total_sweets_Ter(8.5,Inf] -0.0097071229
## hs_total_veg_Ter(6,8.5] 0.0031586461
## hs_total_veg_Ter(8.5,Inf] -0.0567605211
## hs_total_yog_Ter(6,8.5] -0.0245534422
## hs_total_yog_Ter(8.5,Inf] -0.0386998840
## hs_as_c_Log2 0.0050439215
## hs_cd_c_Log2 -0.0352737869
## hs_co_c_Log2 -0.0396473666
## hs_cs_c_Log2 0.0905666600
## hs_cu_c_Log2 0.5291861050
## hs_hg_c_Log2 -0.0253437065
## hs_mn_c_Log2 -0.0187832842
## hs_mo_c_Log2 -0.0835328881
## hs_pb_c_Log2 -0.0275390915
## hs_tl_cdich_None .
## hs_dde_cadj_Log2 -0.0366806354
## hs_ddt_cadj_Log2 0.0032185740
## hs_hcb_cadj_Log2 -0.0317509983
## hs_pcb118_cadj_Log2 0.0025521400
## hs_pcb138_cadj_Log2 -0.0518399321
## hs_pcb153_cadj_Log2 -0.1215197442
## hs_pcb170_cadj_Log2 -0.0418593821
## hs_pcb180_cadj_Log2 -0.0225049584
## hs_dep_cadj_Log2 -0.0189572104
## hs_detp_cadj_Log2 0.0059280868
## hs_dmdtp_cdich_None .
## hs_dmp_cadj_Log2 -0.0024527279
## hs_dmtp_cadj_Log2 0.0008420662
## hs_pbde153_cadj_Log2 -0.0277474044
## hs_pbde47_cadj_Log2 0.0052481134
## hs_pfhxs_c_Log2 -0.0305593699
## hs_pfna_c_Log2 -0.0041077407
## hs_pfoa_c_Log2 -0.1108211867
## hs_pfos_c_Log2 -0.0475012252
## hs_pfunda_c_Log2 0.0072385180
## hs_bpa_cadj_Log2 -0.0063616978
## hs_bupa_cadj_Log2 0.0036910227
## hs_etpa_cadj_Log2 -0.0049963326
## hs_mepa_cadj_Log2 -0.0096168009
## hs_oxbe_cadj_Log2 0.0101184198
## hs_prpa_cadj_Log2 0.0061492375
## hs_trcs_cadj_Log2 0.0062007460
## hs_mbzp_cadj_Log2 0.0427566149
## hs_mecpp_cadj_Log2 0.0064702943
## hs_mehhp_cadj_Log2 0.0137458333
## hs_mehp_cadj_Log2 -0.0049569930
## hs_meohp_cadj_Log2 0.0093327409
## hs_mep_cadj_Log2 0.0064280760
## hs_mibp_cadj_Log2 -0.0385576131
## hs_mnbp_cadj_Log2 -0.0344895032
## hs_ohminp_cadj_Log2 -0.0210558232
## hs_oxominp_cadj_Log2 0.0113725933
## hs_cotinine_cdich_None .
## hs_globalexp2_None .
cat("Model without Covariates - Test MSE:", mse_without_covariates, "\n")
## Model without Covariates - Test MSE: 1.154844
# ELASTIC NET
fit_without_covariates <- cv.glmnet(x_combined_train, y_train, alpha = 0.5, family = "gaussian")
predictions_without_covariates <- predict(fit_without_covariates, s = "lambda.min", newx = x_combined_test)
mse_without_covariates <- mean((y_test - predictions_without_covariates)^2)
plot(fit_without_covariates, xvar = "lambda", main = "Coefficient Path (Without Covariates)")
best_lambda <- fit_without_covariates$lambda.min # lambda that minimizes the MSE
coef(fit_without_covariates, s = best_lambda)
## 90 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) -5.0446731390
## h_bfdur_Ter(0,10.8] -0.1251508851
## h_bfdur_Ter(10.8,34.9] .
## h_bfdur_Ter(34.9,Inf] 0.0089409662
## hs_bakery_prod_Ter(2,6] .
## hs_bakery_prod_Ter(6,Inf] -0.2141179309
## hs_beverages_Ter(0.132,1] .
## hs_beverages_Ter(1,Inf] .
## hs_break_cer_Ter(1.1,5.5] .
## hs_break_cer_Ter(5.5,Inf] .
## hs_caff_drink_Ter(0.132,Inf] .
## hs_dairy_Ter(14.6,25.6] 0.0175630268
## hs_dairy_Ter(25.6,Inf] .
## hs_fastfood_Ter(0.132,0.5] 0.0739162130
## hs_fastfood_Ter(0.5,Inf] .
## hs_org_food_Ter(0.132,1] .
## hs_org_food_Ter(1,Inf] -0.0049809964
## hs_proc_meat_Ter(1.5,4] .
## hs_proc_meat_Ter(4,Inf] .
## hs_readymade_Ter(0.132,0.5] .
## hs_readymade_Ter(0.5,Inf] 0.0170253867
## hs_total_bread_Ter(7,17.5] -0.0178078486
## hs_total_bread_Ter(17.5,Inf] .
## hs_total_cereal_Ter(14.1,23.6] .
## hs_total_cereal_Ter(23.6,Inf] .
## hs_total_fish_Ter(1.5,3] -0.0311197485
## hs_total_fish_Ter(3,Inf] .
## hs_total_fruits_Ter(7,14.1] 0.0058224545
## hs_total_fruits_Ter(14.1,Inf] -0.0180115810
## hs_total_lipids_Ter(3,7] .
## hs_total_lipids_Ter(7,Inf] -0.0529611086
## hs_total_meat_Ter(6,9] .
## hs_total_meat_Ter(9,Inf] .
## hs_total_potatoes_Ter(3,4] 0.0233163117
## hs_total_potatoes_Ter(4,Inf] .
## hs_total_sweets_Ter(4.1,8.5] -0.0128007469
## hs_total_sweets_Ter(8.5,Inf] .
## hs_total_veg_Ter(6,8.5] .
## hs_total_veg_Ter(8.5,Inf] -0.0436127333
## hs_total_yog_Ter(6,8.5] .
## hs_total_yog_Ter(8.5,Inf] .
## hs_as_c_Log2 .
## hs_cd_c_Log2 -0.0242086233
## hs_co_c_Log2 -0.0084586207
## hs_cs_c_Log2 0.0772668783
## hs_cu_c_Log2 0.6571106900
## hs_hg_c_Log2 -0.0143619887
## hs_mn_c_Log2 .
## hs_mo_c_Log2 -0.0974389913
## hs_pb_c_Log2 .
## hs_tl_cdich_None .
## hs_dde_cadj_Log2 -0.0321212412
## hs_ddt_cadj_Log2 .
## hs_hcb_cadj_Log2 .
## hs_pcb118_cadj_Log2 .
## hs_pcb138_cadj_Log2 .
## hs_pcb153_cadj_Log2 -0.2221589832
## hs_pcb170_cadj_Log2 -0.0546331904
## hs_pcb180_cadj_Log2 .
## hs_dep_cadj_Log2 -0.0179999867
## hs_detp_cadj_Log2 .
## hs_dmdtp_cdich_None .
## hs_dmp_cadj_Log2 .
## hs_dmtp_cadj_Log2 .
## hs_pbde153_cadj_Log2 -0.0351341084
## hs_pbde47_cadj_Log2 .
## hs_pfhxs_c_Log2 -0.0055363055
## hs_pfna_c_Log2 .
## hs_pfoa_c_Log2 -0.1254532888
## hs_pfos_c_Log2 -0.0469893259
## hs_pfunda_c_Log2 .
## hs_bpa_cadj_Log2 .
## hs_bupa_cadj_Log2 .
## hs_etpa_cadj_Log2 .
## hs_mepa_cadj_Log2 .
## hs_oxbe_cadj_Log2 .
## hs_prpa_cadj_Log2 0.0001965683
## hs_trcs_cadj_Log2 .
## hs_mbzp_cadj_Log2 0.0457827093
## hs_mecpp_cadj_Log2 .
## hs_mehhp_cadj_Log2 .
## hs_mehp_cadj_Log2 .
## hs_meohp_cadj_Log2 .
## hs_mep_cadj_Log2 .
## hs_mibp_cadj_Log2 -0.0415220843
## hs_mnbp_cadj_Log2 -0.0111086286
## hs_ohminp_cadj_Log2 .
## hs_oxominp_cadj_Log2 .
## hs_cotinine_cdich_None .
## hs_globalexp2_None .
cat("Model without Covariates - Test MSE:", mse_without_covariates, "\n")
## Model without Covariates - Test MSE: 1.198308
Selected data based on the enet features without covariates.
Froze the covariates in lasso, ridge and enet to avoid shrinkage.
#selected chemicals that were noted in enet
chemicals_selected <- c(
"hs_cd_c_Log2",
"hs_co_c_Log2",
"hs_cs_c_Log2",
"hs_cu_c_Log2",
"hs_hg_c_Log2",
"hs_mo_c_Log2",
"hs_pb_c_Log2",
"hs_dde_cadj_Log2",
"hs_pcb153_cadj_Log2",
"hs_pcb170_cadj_Log2",
"hs_dep_cadj_Log2",
"hs_detp_cadj_Log2",
"hs_pbde153_cadj_Log2",
"hs_pfhxs_c_Log2",
"hs_pfoa_c_Log2",
"hs_pfos_c_Log2",
"hs_mepa_cadj_Log2",
"hs_oxbe_cadj_Log2",
"hs_prpa_cadj_Log2",
"hs_mbzp_cadj_Log2",
"hs_mibp_cadj_Log2",
"hs_mnbp_cadj_Log2")
#selected diets that were noted in enet
diet_selected <- c(
"h_bfdur_Ter",
"hs_bakery_prod_Ter",
"hs_break_cer_Ter",
"hs_dairy_Ter",
"hs_fastfood_Ter",
"hs_org_food_Ter",
"hs_proc_meat_Ter",
"hs_total_fish_Ter",
"hs_total_fruits_Ter",
"hs_total_lipids_Ter",
"hs_total_sweets_Ter",
"hs_total_veg_Ter"
)
combined_data_selected <- c(
"h_bfdur_Ter",
"hs_bakery_prod_Ter",
"hs_break_cer_Ter",
"hs_dairy_Ter",
"hs_fastfood_Ter",
"hs_org_food_Ter",
"hs_proc_meat_Ter",
"hs_total_fish_Ter",
"hs_total_fruits_Ter",
"hs_total_lipids_Ter",
"hs_total_sweets_Ter",
"hs_total_veg_Ter",
"hs_cd_c_Log2",
"hs_co_c_Log2",
"hs_cs_c_Log2",
"hs_cu_c_Log2",
"hs_hg_c_Log2",
"hs_mo_c_Log2",
"hs_pb_c_Log2",
"hs_dde_cadj_Log2",
"hs_pcb153_cadj_Log2",
"hs_pcb170_cadj_Log2",
"hs_dep_cadj_Log2",
"hs_pbde153_cadj_Log2",
"hs_pfhxs_c_Log2",
"hs_pfoa_c_Log2",
"hs_pfos_c_Log2",
"hs_prpa_cadj_Log2",
"hs_mbzp_cadj_Log2",
"hs_mibp_cadj_Log2",
"hs_mnbp_cadj_Log2"
)
outcome_cov <- cbind(covariate_data, outcome_BMI)
outcome_cov <- outcome_cov[, !duplicated(colnames(outcome_cov))]
finalized_columns <- c(combined_data_selected)
final_selected_data <- exposome %>% dplyr::select(all_of(finalized_columns))
finalized_data <- cbind(outcome_cov, final_selected_data)
head(finalized_data)
numeric_finalized <- finalized_data %>%
dplyr::select(where(is.numeric))
cor_matrix <- cor(numeric_finalized, use = "complete.obs")
corrplot(cor_matrix, method = "color", type = "upper", tl.col = "black", tl.srt = 90, tl.cex = 0.6)
find_highly_correlated <- function(cor_matrix, threshold = 0.8) {
cor_matrix[lower.tri(cor_matrix, diag = TRUE)] <- NA
cor_matrix <- as.data.frame(as.table(cor_matrix))
cor_matrix <- na.omit(cor_matrix)
cor_matrix <- cor_matrix[order(-abs(cor_matrix$Freq)), ]
cor_matrix <- cor_matrix %>% filter(abs(Freq) > threshold)
return(cor_matrix)
}
highly_correlated_pairs <- find_highly_correlated(cor_matrix, threshold = 0.50)
highly_correlated_pairs
set.seed(101)
# Splitting data into training and test sets
train_indices <- sample(seq_len(nrow(finalized_data)), size = floor(0.7 * nrow(finalized_data)))
test_indices <- setdiff(seq_len(nrow(finalized_data)), train_indices)
# Creating training and test datasets
train_data <- finalized_data[train_indices, ]
test_data <- finalized_data[test_indices, ]
# Separating predictors and outcome variable
x_train <- model.matrix(~ . + 0, data = train_data[ , !names(train_data) %in% "hs_zbmi_who"])
x_test <- model.matrix(~ . + 0, data = test_data[ , !names(test_data) %in% "hs_zbmi_who"])
y_train <- train_data$hs_zbmi_who
y_test <- test_data$hs_zbmi_who
covariates_selected <- c("hs_child_age_None", "h_cohort", "e3_sex_None", "e3_yearbir_None")
#to freeze the covariates and make sure they are not shrinked
penalty_factors <- rep(1, ncol(x_train))
penalty_factors[colnames(x_train) %in% covariates_selected] <- 0
covariates_selected <- c("hs_child_age_None", "h_cohort", "e3_sex_None", "e3_yearbir_None")
baseline_data <- finalized_data %>% dplyr::select(c(covariates_selected, "hs_zbmi_who"))
x <- model.matrix(~ . -1, data = baseline_data[ , !names(baseline_data) %in% "hs_zbmi_who"])
y <- baseline_data$hs_zbmi_who
set.seed(101)
trainIndex <- createDataPartition(y, p = .7, list = FALSE, times = 1)
x_train <- x[trainIndex, ]
y_train <- y[trainIndex]
x_test <- x[-trainIndex, ]
y_test <- y[-trainIndex]
penalty_factors <- rep(1, ncol(x_train))
penalty_factors[colnames(x_train) %in% covariates_selected] <- 0
set.seed(101)
fit_lasso <- cv.glmnet(x_train, y_train, alpha = 1, family = "gaussian", penalty.factor = penalty_factors)
plot(fit_lasso)
coef(fit_lasso)
## 15 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 0.87326347
## hs_child_age_None -0.05738585
## h_cohort1 .
## h_cohort2 .
## h_cohort3 .
## h_cohort4 .
## h_cohort5 .
## h_cohort6 .
## e3_sex_Nonemale .
## e3_yearbir_None2004 .
## e3_yearbir_None2005 .
## e3_yearbir_None2006 .
## e3_yearbir_None2007 .
## e3_yearbir_None2008 .
## e3_yearbir_None2009 .
lasso_predictions <- predict(fit_lasso, s = "lambda.min", newx = x_test)
mse_lasso <- mean((y_test - lasso_predictions)^2)
rmse_lasso <- sqrt(mse_lasso)
cat("Baseline Lasso MSE:", mse_lasso, "\n")
## Baseline Lasso MSE: 1.252768
cat("Baseline Lasso RMSE:", rmse_lasso, "\n")
## Baseline Lasso RMSE: 1.119271
set.seed(101)
fit_ridge <- cv.glmnet(x_train, y_train, alpha = 0, family = "gaussian", penalty.factor = penalty_factors)
plot(fit_ridge)
coef(fit_ridge)
## 15 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 8.732635e-01
## hs_child_age_None -5.738585e-02
## h_cohort1 -3.471482e-37
## h_cohort2 -9.186643e-38
## h_cohort3 4.060919e-37
## h_cohort4 1.233393e-37
## h_cohort5 -2.947305e-37
## h_cohort6 1.969905e-37
## e3_sex_Nonemale 3.249059e-38
## e3_yearbir_None2004 -1.529129e-37
## e3_yearbir_None2005 2.825661e-38
## e3_yearbir_None2006 -9.749392e-39
## e3_yearbir_None2007 -5.688503e-38
## e3_yearbir_None2008 3.269518e-38
## e3_yearbir_None2009 3.782484e-37
ridge_predictions <- predict(fit_ridge, s = "lambda.min", newx = x_test)
mse_ridge <- mean((y_test - ridge_predictions)^2)
rmse_ridge <- sqrt(mse_ridge)
cat("Baseline Ridge MSE:", mse_ridge, "\n")
## Baseline Ridge MSE: 1.242846
cat("Baseline Ridge RMSE:", rmse_ridge, "\n")
## Baseline Ridge RMSE: 1.11483
set.seed(101)
fit_enet <- cv.glmnet(x_train, y_train, alpha = 0.5, family = "gaussian", penalty.factor = penalty_factors)
plot(fit_enet)
coef(fit_enet)
## 15 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 0.87326347
## hs_child_age_None -0.05738585
## h_cohort1 .
## h_cohort2 .
## h_cohort3 .
## h_cohort4 .
## h_cohort5 .
## h_cohort6 .
## e3_sex_Nonemale .
## e3_yearbir_None2004 .
## e3_yearbir_None2005 .
## e3_yearbir_None2006 .
## e3_yearbir_None2007 .
## e3_yearbir_None2008 .
## e3_yearbir_None2009 .
enet_predictions <- predict(fit_enet, s = "lambda.min", newx = x_test)
mse_enet <- mean((y_test - enet_predictions)^2)
rmse_enet <- sqrt(mse_enet)
cat("Baseline Elastic Net MSE:", mse_enet, "\n")
## Baseline Elastic Net MSE: 1.253518
cat("Baseline Elastic Net RMSE:", rmse_enet, "\n")
## Baseline Elastic Net RMSE: 1.119606
set.seed(101)
trainIndex <- createDataPartition(baseline_data$hs_zbmi_who, p = .7, list = FALSE, times = 1)
train_data <- baseline_data[trainIndex, ]
test_data <- baseline_data[-trainIndex, ]
fit_tree <- rpart(hs_zbmi_who ~ ., data = train_data, method = "anova")
rpart.plot(fit_tree)
tree_predictions <- predict(fit_tree, newdata = test_data)
mse_tree <- mean((test_data$hs_zbmi_who - tree_predictions)^2)
rmse_tree <- sqrt(mse_tree)
cat("Baseline Decision Tree MSE:", mse_tree, "\n")
## Baseline Decision Tree MSE: 1.251346
cat("Baseline Decision Tree RMSE:", rmse_tree, "\n")
## Baseline Decision Tree RMSE: 1.118636
The decision tree indicates the importance of covariates, with cohort being significant predictors.
fit_rf <- randomForest(hs_zbmi_who ~ ., data = train_data, ntree = 500, importance = TRUE)
varImpPlot(fit_rf)
rf_predictions <- predict(fit_rf, newdata = test_data)
mse_rf <- mean((test_data$hs_zbmi_who - rf_predictions)^2)
rmse_rf <- sqrt(mse_rf)
cat("Baseline Random Forest MSE:", mse_rf, "\n")
## Baseline Random Forest MSE: 1.285745
cat("Baseline Random Forest RMSE:", rmse_rf, "\n")
## Baseline Random Forest RMSE: 1.133907
Importance plots show that age and cohort are the most influential variables.
fit_gbm <- gbm(hs_zbmi_who ~ ., data = train_data, distribution = "gaussian", n.trees = 1000, interaction.depth = 3, shrinkage = 0.01, cv.folds = 5, verbose = FALSE)
summary(fit_gbm)
best_trees <- gbm.perf(fit_gbm, method = "cv")
gbm_predictions <- predict(fit_gbm, newdata = test_data, n.trees = best_trees)
mse_gbm <- mean((test_data$hs_zbmi_who - gbm_predictions)^2)
rmse_gbm <- sqrt(mse_gbm)
cat("Baseline GBM MSE:", mse_gbm, "\n")
## Baseline GBM MSE: 1.26256
cat("Baseline GBM RMSE:", rmse_gbm, "\n")
## Baseline GBM RMSE: 1.123637
Model Comparison:
The results indicate that Ridge Regression performs slightly better than Lasso and Elastic Net in terms of MSE and RMSE.
The Decision Tree model also performs comparably to the regularization methods with an MSE of 1.251 and RMSE of 1.119. Decision Trees are advantageous for their interpretability, providing clear decision rules based on the covariates.
The Random Forest model shows a slightly higher MSE (1.286) and RMSE (1.134) compared to the regularization methods and Decision Tree. Despite the slightly higher error metrics, Random Forest models are powerful in capturing complex interactions between variables due to the ensemble approach of multiple decision trees.
GBM performs better than Random Forest with an MSE of 1.262 and RMSE of 1.124, indicating its effectiveness in reducing error through boosting iterations.
Age and cohort are consistently the most important variables across different models. Regularized regression models (Ridge and Elastic Net) provide a balance between prediction accuracy and model interpretability.
diet_selected <- c(
"h_bfdur_Ter",
"hs_bakery_prod_Ter",
"hs_break_cer_Ter",
"hs_dairy_Ter",
"hs_fastfood_Ter",
"hs_org_food_Ter",
"hs_proc_meat_Ter",
"hs_total_fish_Ter",
"hs_total_fruits_Ter",
"hs_total_lipids_Ter",
"hs_total_sweets_Ter",
"hs_total_veg_Ter"
)
covariates_selected <- c("hs_child_age_None", "h_cohort", "e3_sex_None", "e3_yearbir_None")
diet_data <- finalized_data %>% dplyr::select(c(covariates_selected, diet_selected, "hs_zbmi_who"))
set.seed(101)
trainIndex <- createDataPartition(diet_data$hs_zbmi_who, p = .7, list = FALSE, times = 1)
train_data <- diet_data[trainIndex, ]
test_data <- diet_data[-trainIndex, ]
x_train <- model.matrix(hs_zbmi_who ~ . -1, train_data)
y_train <- train_data$hs_zbmi_who
x_test <- model.matrix(hs_zbmi_who ~ . -1, test_data)
y_test <- test_data$hs_zbmi_who
penalty_factors <- rep(1, ncol(x_train))
penalty_factors[colnames(x_train) %in% covariates_selected] <- 0
fit_lasso <- cv.glmnet(x_train, y_train, alpha = 1, family = "gaussian", penalty.factor = penalty_factors)
plot(fit_lasso)
coef(fit_lasso)
## 39 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 0.87326347
## hs_child_age_None -0.05738585
## h_cohort1 .
## h_cohort2 .
## h_cohort3 .
## h_cohort4 .
## h_cohort5 .
## h_cohort6 .
## e3_sex_Nonemale .
## e3_yearbir_None2004 .
## e3_yearbir_None2005 .
## e3_yearbir_None2006 .
## e3_yearbir_None2007 .
## e3_yearbir_None2008 .
## e3_yearbir_None2009 .
## h_bfdur_Ter(10.8,34.9] .
## h_bfdur_Ter(34.9,Inf] .
## hs_bakery_prod_Ter(2,6] .
## hs_bakery_prod_Ter(6,Inf] .
## hs_break_cer_Ter(1.1,5.5] .
## hs_break_cer_Ter(5.5,Inf] .
## hs_dairy_Ter(14.6,25.6] .
## hs_dairy_Ter(25.6,Inf] .
## hs_fastfood_Ter(0.132,0.5] .
## hs_fastfood_Ter(0.5,Inf] .
## hs_org_food_Ter(0.132,1] .
## hs_org_food_Ter(1,Inf] .
## hs_proc_meat_Ter(1.5,4] .
## hs_proc_meat_Ter(4,Inf] .
## hs_total_fish_Ter(1.5,3] .
## hs_total_fish_Ter(3,Inf] .
## hs_total_fruits_Ter(7,14.1] .
## hs_total_fruits_Ter(14.1,Inf] .
## hs_total_lipids_Ter(3,7] .
## hs_total_lipids_Ter(7,Inf] .
## hs_total_sweets_Ter(4.1,8.5] .
## hs_total_sweets_Ter(8.5,Inf] .
## hs_total_veg_Ter(6,8.5] .
## hs_total_veg_Ter(8.5,Inf] .
lasso_predictions <- predict(fit_lasso, s = "lambda.min", newx = x_test)
mse_lasso <- mean((y_test - lasso_predictions)^2)
rmse_lasso <- sqrt(mse_lasso)
cat("Diet + Covariates Lasso MSE:", mse_lasso, "\n")
## Diet + Covariates Lasso MSE: 1.254204
cat("Diet + Covariates Lasso RMSE:", rmse_lasso, "\n")
## Diet + Covariates Lasso RMSE: 1.119912
fit_ridge <- cv.glmnet(x_train, y_train, alpha = 0, family = "gaussian", penalty.factor = penalty_factors)
plot(fit_ridge)
coef(fit_ridge)
## 39 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 8.732635e-01
## hs_child_age_None -5.738585e-02
## h_cohort1 -3.640137e-37
## h_cohort2 -9.632957e-38
## h_cohort3 4.258211e-37
## h_cohort4 1.293315e-37
## h_cohort5 -3.090494e-37
## h_cohort6 2.065609e-37
## e3_sex_Nonemale 3.406909e-38
## e3_yearbir_None2004 -1.603419e-37
## e3_yearbir_None2005 2.962941e-38
## e3_yearbir_None2006 -1.022305e-38
## e3_yearbir_None2007 -5.964868e-38
## e3_yearbir_None2008 3.428361e-38
## e3_yearbir_None2009 3.966249e-37
## h_bfdur_Ter(10.8,34.9] 1.967500e-37
## h_bfdur_Ter(34.9,Inf] -1.201068e-37
## hs_bakery_prod_Ter(2,6] 2.111406e-37
## hs_bakery_prod_Ter(6,Inf] -1.904476e-37
## hs_break_cer_Ter(1.1,5.5] 9.841445e-39
## hs_break_cer_Ter(5.5,Inf] -2.296259e-37
## hs_dairy_Ter(14.6,25.6] 5.657332e-38
## hs_dairy_Ter(25.6,Inf] 8.250493e-41
## hs_fastfood_Ter(0.132,0.5] 1.220259e-37
## hs_fastfood_Ter(0.5,Inf] -6.260868e-38
## hs_org_food_Ter(0.132,1] 8.179714e-38
## hs_org_food_Ter(1,Inf] -1.908079e-37
## hs_proc_meat_Ter(1.5,4] 1.198077e-37
## hs_proc_meat_Ter(4,Inf] -2.002609e-38
## hs_total_fish_Ter(1.5,3] -8.939327e-38
## hs_total_fish_Ter(3,Inf] 2.150559e-38
## hs_total_fruits_Ter(7,14.1] 2.176951e-37
## hs_total_fruits_Ter(14.1,Inf] -2.048991e-37
## hs_total_lipids_Ter(3,7] -9.208028e-39
## hs_total_lipids_Ter(7,Inf] -2.162334e-37
## hs_total_sweets_Ter(4.1,8.5] -7.709333e-38
## hs_total_sweets_Ter(8.5,Inf] -1.723753e-38
## hs_total_veg_Ter(6,8.5] 2.140950e-38
## hs_total_veg_Ter(8.5,Inf] -1.287878e-37
ridge_predictions <- predict(fit_ridge, s = "lambda.min", newx = x_test)
mse_ridge <- mean((y_test - ridge_predictions)^2)
rmse_ridge <- sqrt(mse_ridge)
cat("Diet + Covariates Ridge MSE:", mse_ridge, "\n")
## Diet + Covariates Ridge MSE: 1.248868
cat("Diet + Covariates Ridge RMSE:", rmse_ridge, "\n")
## Diet + Covariates Ridge RMSE: 1.117528
fit_enet <- cv.glmnet(x_train, y_train, alpha = 0.5, family = "gaussian", penalty.factor = penalty_factors)
plot(fit_enet)
coef(fit_enet)
## 39 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 0.87326347
## hs_child_age_None -0.05738585
## h_cohort1 .
## h_cohort2 .
## h_cohort3 .
## h_cohort4 .
## h_cohort5 .
## h_cohort6 .
## e3_sex_Nonemale .
## e3_yearbir_None2004 .
## e3_yearbir_None2005 .
## e3_yearbir_None2006 .
## e3_yearbir_None2007 .
## e3_yearbir_None2008 .
## e3_yearbir_None2009 .
## h_bfdur_Ter(10.8,34.9] .
## h_bfdur_Ter(34.9,Inf] .
## hs_bakery_prod_Ter(2,6] .
## hs_bakery_prod_Ter(6,Inf] .
## hs_break_cer_Ter(1.1,5.5] .
## hs_break_cer_Ter(5.5,Inf] .
## hs_dairy_Ter(14.6,25.6] .
## hs_dairy_Ter(25.6,Inf] .
## hs_fastfood_Ter(0.132,0.5] .
## hs_fastfood_Ter(0.5,Inf] .
## hs_org_food_Ter(0.132,1] .
## hs_org_food_Ter(1,Inf] .
## hs_proc_meat_Ter(1.5,4] .
## hs_proc_meat_Ter(4,Inf] .
## hs_total_fish_Ter(1.5,3] .
## hs_total_fish_Ter(3,Inf] .
## hs_total_fruits_Ter(7,14.1] .
## hs_total_fruits_Ter(14.1,Inf] .
## hs_total_lipids_Ter(3,7] .
## hs_total_lipids_Ter(7,Inf] .
## hs_total_sweets_Ter(4.1,8.5] .
## hs_total_sweets_Ter(8.5,Inf] .
## hs_total_veg_Ter(6,8.5] .
## hs_total_veg_Ter(8.5,Inf] .
enet_predictions <- predict(fit_enet, s = "lambda.min", newx = x_test)
mse_enet <- mean((y_test - enet_predictions)^2)
rmse_enet <- sqrt(mse_enet)
cat("Diet + Covariates Elastic Net MSE:", mse_enet, "\n")
## Diet + Covariates Elastic Net MSE: 1.25324
cat("Diet + Covariates Elastic Net RMSE:", rmse_enet, "\n")
## Diet + Covariates Elastic Net RMSE: 1.119482
Ridge Regression continues to perform well with the addition of diet variables, maintaining its position as one of the best regularization techniques.
fit_tree <- rpart(hs_zbmi_who ~ ., data = train_data, method = "anova")
rpart.plot(fit_tree)
tree_predictions <- predict(fit_tree, newdata = test_data)
mse_tree <- mean((test_data$hs_zbmi_who - tree_predictions)^2)
rmse_tree <- sqrt(mse_tree)
cat("Diet + Covariates Decision Tree MSE:", mse_tree, "\n")
## Diet + Covariates Decision Tree MSE: 1.249936
cat("Diet + Covariates Decision Tree RMSE:", rmse_tree, "\n")
## Diet + Covariates Decision Tree RMSE: 1.118005
The top predictor is h_cohort3, followed by hs_child_age_None and hs_bakery_prod_Ter. The Decision Tree model slightly improves when adding diet variables, with a decrease in both MSE (from 1.251 to 1.250) and RMSE (from 1.119 to 1.118).
set.seed(101)
fit_rf <- randomForest(hs_zbmi_who ~ ., data = train_data)
varImpPlot(fit_rf)
importance(fit_rf)
## IncNodePurity
## hs_child_age_None 208.99850
## h_cohort 94.27094
## e3_sex_None 33.36319
## e3_yearbir_None 80.56006
## h_bfdur_Ter 57.63270
## hs_bakery_prod_Ter 64.57253
## hs_break_cer_Ter 63.66238
## hs_dairy_Ter 64.81974
## hs_fastfood_Ter 54.23841
## hs_org_food_Ter 58.37291
## hs_proc_meat_Ter 63.18441
## hs_total_fish_Ter 59.59955
## hs_total_fruits_Ter 60.14126
## hs_total_lipids_Ter 59.03292
## hs_total_sweets_Ter 64.33113
## hs_total_veg_Ter 64.78676
rf_predictions <- predict(fit_rf, newdata = test_data)
mse_rf <- mean((test_data$hs_zbmi_who - rf_predictions)^2)
rmse_rf <- sqrt(mse_rf)
cat("Diet + Covariates Random Forest MSE:", mse_rf, "\n")
## Diet + Covariates Random Forest MSE: 1.282615
cat("Diet + Covariates Random Forest RMSE:", rmse_rf, "\n")
## Diet + Covariates Random Forest RMSE: 1.132526
The Random Forest model shows a slight improvement when diet variables are included, with MSE decreasing from 1.286 to 1.283 and RMSE from 1.134 to 1.133.
set.seed(101)
fit_gbm <- gbm(hs_zbmi_who ~ ., data = train_data, distribution = "gaussian", n.trees = 1000, interaction.depth = 3, shrinkage = 0.01, cv.folds = 5, verbose = FALSE)
summary(fit_gbm)
best_trees <- gbm.perf(fit_gbm, method = "cv")
gbm_predictions <- predict(fit_gbm, newdata = test_data, n.trees = best_trees)
mse_gbm <- mean((test_data$hs_zbmi_who - gbm_predictions)^2)
rmse_gbm <- sqrt(mse_gbm)
cat("Diet + Covariates GBM MSE:", mse_gbm, "\n")
## Diet + Covariates GBM MSE: 1.252086
cat("Diet + Covariates GBM RMSE:", rmse_gbm, "\n")
## Diet + Covariates GBM RMSE: 1.118967
GBM shows a significant improvement with diet variables, indicating its effectiveness in leveraging additional predictors for better performance.
Model Comparison:
Both Decision Tree and Random Forest models benefit from the inclusion of diet variables, although the improvements are modest. These models can capture complex interactions between diet and covariates.These findings suggest that dietary factors, in conjunction with covariates, enhance the predictive power of the models, justifying their inclusion in more complex modeling efforts.
chemicals_selected <- c(
"hs_cd_c_Log2", "hs_co_c_Log2", "hs_cs_c_Log2", "hs_cu_c_Log2",
"hs_hg_c_Log2", "hs_mo_c_Log2", "hs_pb_c_Log2", "hs_dde_cadj_Log2",
"hs_pcb153_cadj_Log2", "hs_pcb170_cadj_Log2", "hs_dep_cadj_Log2",
"hs_pbde153_cadj_Log2", "hs_pfhxs_c_Log2", "hs_pfoa_c_Log2",
"hs_pfos_c_Log2", "hs_prpa_cadj_Log2", "hs_mbzp_cadj_Log2", "hs_mibp_cadj_Log2",
"hs_mnbp_cadj_Log2"
)
chemical_data <- finalized_data %>% dplyr::select(c(covariates_selected, chemicals_selected, "hs_zbmi_who"))
set.seed(101)
trainIndex <- createDataPartition(chemical_data$hs_zbmi_who, p = .7, list = FALSE, times = 1)
train_data <- chemical_data[trainIndex, ]
test_data <- chemical_data[-trainIndex, ]
x_train <- model.matrix(hs_zbmi_who ~ . -1, train_data)
y_train <- train_data$hs_zbmi_who
x_test <- model.matrix(hs_zbmi_who ~ . -1, test_data)
y_test <- test_data$hs_zbmi_who
penalty_factors <- rep(1, ncol(x_train))
penalty_factors[colnames(x_train) %in% covariates_selected] <- 0
fit_lasso <- cv.glmnet(x_train, y_train, alpha = 1, family = "gaussian", penalty.factor = penalty_factors)
plot(fit_lasso)
coef(fit_lasso)
## 34 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) -0.22880486
## hs_child_age_None -0.06945348
## h_cohort1 -0.35586244
## h_cohort2 .
## h_cohort3 0.15249341
## h_cohort4 .
## h_cohort5 .
## h_cohort6 .
## e3_sex_Nonemale .
## e3_yearbir_None2004 .
## e3_yearbir_None2005 .
## e3_yearbir_None2006 .
## e3_yearbir_None2007 .
## e3_yearbir_None2008 .
## e3_yearbir_None2009 .
## hs_cd_c_Log2 .
## hs_co_c_Log2 .
## hs_cs_c_Log2 .
## hs_cu_c_Log2 0.19017840
## hs_hg_c_Log2 .
## hs_mo_c_Log2 -0.03891260
## hs_pb_c_Log2 .
## hs_dde_cadj_Log2 -0.04222117
## hs_pcb153_cadj_Log2 -0.16375278
## hs_pcb170_cadj_Log2 -0.04202673
## hs_dep_cadj_Log2 .
## hs_pbde153_cadj_Log2 -0.02572945
## hs_pfhxs_c_Log2 .
## hs_pfoa_c_Log2 -0.01268486
## hs_pfos_c_Log2 .
## hs_prpa_cadj_Log2 .
## hs_mbzp_cadj_Log2 .
## hs_mibp_cadj_Log2 .
## hs_mnbp_cadj_Log2 .
lasso_predictions <- predict(fit_lasso, s = "lambda.min", newx = x_test)
mse_lasso <- mean((y_test - lasso_predictions)^2)
rmse_lasso <- sqrt(mse_lasso)
cat("Chemical + Covariates Lasso MSE:", mse_lasso, "\n")
## Chemical + Covariates Lasso MSE: 1.024987
cat("Chemical + Covariates Lasso RMSE:", rmse_lasso, "\n")
## Chemical + Covariates Lasso RMSE: 1.012416
fit_ridge <- cv.glmnet(x_train, y_train, alpha = 0, family = "gaussian", penalty.factor = penalty_factors)
plot(fit_ridge)
coef(fit_ridge)
## 34 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) -1.5091761675
## hs_child_age_None -0.0482701048
## h_cohort1 -0.2135450987
## h_cohort2 -0.0811254696
## h_cohort3 0.1431835261
## h_cohort4 0.0960498652
## h_cohort5 -0.0234630806
## h_cohort6 0.0491206037
## e3_sex_Nonemale 0.0372834068
## e3_yearbir_None2004 -0.0970866409
## e3_yearbir_None2005 -0.0052724110
## e3_yearbir_None2006 0.0224854247
## e3_yearbir_None2007 -0.0129379246
## e3_yearbir_None2008 0.0003494858
## e3_yearbir_None2009 0.2186823716
## hs_cd_c_Log2 -0.0296210057
## hs_co_c_Log2 -0.0280149777
## hs_cs_c_Log2 0.0647444145
## hs_cu_c_Log2 0.3115974897
## hs_hg_c_Log2 -0.0140060992
## hs_mo_c_Log2 -0.0639201855
## hs_pb_c_Log2 -0.0378306310
## hs_dde_cadj_Log2 -0.0559819680
## hs_pcb153_cadj_Log2 -0.1241970865
## hs_pcb170_cadj_Log2 -0.0369214916
## hs_dep_cadj_Log2 -0.0092793135
## hs_pbde153_cadj_Log2 -0.0236726456
## hs_pfhxs_c_Log2 -0.0147308317
## hs_pfoa_c_Log2 -0.0837103171
## hs_pfos_c_Log2 -0.0202621556
## hs_prpa_cadj_Log2 0.0009023231
## hs_mbzp_cadj_Log2 0.0266741033
## hs_mibp_cadj_Log2 -0.0283116127
## hs_mnbp_cadj_Log2 -0.0354752569
ridge_predictions <- predict(fit_ridge, s = "lambda.min", newx = x_test)
mse_ridge <- mean((y_test - ridge_predictions)^2)
rmse_ridge <- sqrt(mse_ridge)
cat("Chemical + Covariates Ridge MSE:", mse_ridge, "\n")
## Chemical + Covariates Ridge MSE: 1.022941
cat("Chemical + Covariates Ridge RMSE:", rmse_ridge, "\n")
## Chemical + Covariates Ridge RMSE: 1.011405
fit_enet <- cv.glmnet(x_train, y_train, alpha = 0.5, family = "gaussian", penalty.factor = penalty_factors)
plot(fit_enet)
coef(fit_enet)
## 34 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) -0.69372755
## hs_child_age_None -0.07690726
## h_cohort1 -0.38803152
## h_cohort2 .
## h_cohort3 0.17565147
## h_cohort4 .
## h_cohort5 .
## h_cohort6 .
## e3_sex_Nonemale .
## e3_yearbir_None2004 .
## e3_yearbir_None2005 .
## e3_yearbir_None2006 .
## e3_yearbir_None2007 .
## e3_yearbir_None2008 .
## e3_yearbir_None2009 .
## hs_cd_c_Log2 .
## hs_co_c_Log2 .
## hs_cs_c_Log2 .
## hs_cu_c_Log2 0.24840977
## hs_hg_c_Log2 .
## hs_mo_c_Log2 -0.05153618
## hs_pb_c_Log2 .
## hs_dde_cadj_Log2 -0.05395542
## hs_pcb153_cadj_Log2 -0.16065967
## hs_pcb170_cadj_Log2 -0.04332864
## hs_dep_cadj_Log2 .
## hs_pbde153_cadj_Log2 -0.02668759
## hs_pfhxs_c_Log2 .
## hs_pfoa_c_Log2 -0.03227568
## hs_pfos_c_Log2 .
## hs_prpa_cadj_Log2 .
## hs_mbzp_cadj_Log2 .
## hs_mibp_cadj_Log2 .
## hs_mnbp_cadj_Log2 .
enet_predictions <- predict(fit_enet, s = "lambda.min", newx = x_test)
mse_enet <- mean((y_test - enet_predictions)^2)
rmse_enet <- sqrt(mse_enet)
cat("Chemical + Covariates Elastic Net MSE:", mse_enet, "\n")
## Chemical + Covariates Elastic Net MSE: 1.025036
cat("Chemical + Covariates Elastic Net RMSE:", rmse_enet, "\n")
## Chemical + Covariates Elastic Net RMSE: 1.012441
Lasso and Ridge Regression: Both models show marked improvements, indicating that chemical variables are highly predictive.
fit_tree <- rpart(hs_zbmi_who ~ ., data = train_data, method = "anova")
rpart.plot(fit_tree)
tree_predictions <- predict(fit_tree, newdata = test_data)
mse_tree <- mean((y_test - tree_predictions)^2)
rmse_tree <- sqrt(mse_tree)
cat("Chemical + Covariates Decision Tree MSE:", mse_tree, "\n")
## Chemical + Covariates Decision Tree MSE: 1.240155
cat("Chemical + Covariates Decision Tree RMSE:", rmse_tree, "\n")
## Chemical + Covariates Decision Tree RMSE: 1.113622
The Decision Tree model benefits from chemical variables, though the improvements are more modest compared to the other models.
fit_rf <- randomForest(hs_zbmi_who ~ ., data = train_data, ntree = 500)
rf_predictions <- predict(fit_rf, newdata = test_data)
varImpPlot(fit_rf)
importance(fit_rf)
## IncNodePurity
## hs_child_age_None 43.503081
## h_cohort 55.667023
## e3_sex_None 6.437968
## e3_yearbir_None 28.303839
## hs_cd_c_Log2 50.951297
## hs_co_c_Log2 49.920695
## hs_cs_c_Log2 39.773024
## hs_cu_c_Log2 66.621641
## hs_hg_c_Log2 49.142881
## hs_mo_c_Log2 54.875936
## hs_pb_c_Log2 44.095008
## hs_dde_cadj_Log2 90.587837
## hs_pcb153_cadj_Log2 82.018781
## hs_pcb170_cadj_Log2 134.334067
## hs_dep_cadj_Log2 53.198302
## hs_pbde153_cadj_Log2 90.473791
## hs_pfhxs_c_Log2 41.857770
## hs_pfoa_c_Log2 60.276778
## hs_pfos_c_Log2 46.492002
## hs_prpa_cadj_Log2 49.081556
## hs_mbzp_cadj_Log2 50.574249
## hs_mibp_cadj_Log2 43.570481
## hs_mnbp_cadj_Log2 51.144486
mse_rf <- mean((y_test - rf_predictions)^2)
rmse_rf <- sqrt(mse_rf)
cat("Chemical + Covariates Random Forest MSE:", mse_rf, "\n")
## Chemical + Covariates Random Forest MSE: 1.025661
cat("Chemical + Covariates Random Forest RMSE:", rmse_rf, "\n")
## Chemical + Covariates Random Forest RMSE: 1.012749
fit_gbm <- gbm(hs_zbmi_who ~ ., data = train_data, distribution = "gaussian", n.trees = 100, interaction.depth = 3, shrinkage = 0.01, cv.folds = 5)
summary(fit_gbm)
best_iter <- gbm.perf(fit_gbm, method = "cv")
gbm_predictions <- predict(fit_gbm, newdata = test_data, n.trees = best_iter)
mse_gbm <- mean((y_test - gbm_predictions)^2)
rmse_gbm <- sqrt(mse_gbm)
cat("Chemical + Covariates GBM MSE:", mse_gbm, "\n")
## Chemical + Covariates GBM MSE: 1.128855
cat("Chemical + Covariates GBM RMSE:", rmse_gbm, "\n")
## Chemical + Covariates GBM RMSE: 1.062476
The GBM model shows substantial improvement, with MSE decreasing from 1.262 to 1.129 and RMSE from 1.124 to 1.062. The GBM model also benefits significantly from the inclusion of chemical variables, showcasing its effectiveness in leveraging additional predictors.
covariates_selected <- c("hs_child_age_None", "h_cohort", "e3_sex_None", "e3_yearbir_None")
diet_selected <- c(
"h_bfdur_Ter", "hs_bakery_prod_Ter", "hs_break_cer_Ter", "hs_dairy_Ter",
"hs_fastfood_Ter", "hs_org_food_Ter", "hs_proc_meat_Ter", "hs_total_fish_Ter",
"hs_total_fruits_Ter", "hs_total_lipids_Ter", "hs_total_sweets_Ter", "hs_total_veg_Ter"
)
chemicals_selected <- c(
"hs_cd_c_Log2", "hs_co_c_Log2", "hs_cs_c_Log2", "hs_cu_c_Log2",
"hs_hg_c_Log2", "hs_mo_c_Log2", "hs_pb_c_Log2", "hs_dde_cadj_Log2",
"hs_pcb153_cadj_Log2", "hs_pcb170_cadj_Log2", "hs_dep_cadj_Log2",
"hs_pbde153_cadj_Log2", "hs_pfhxs_c_Log2", "hs_pfoa_c_Log2",
"hs_pfos_c_Log2", "hs_prpa_cadj_Log2", "hs_mbzp_cadj_Log2", "hs_mibp_cadj_Log2",
"hs_mnbp_cadj_Log2"
)
combined_selected <- c(covariates_selected, diet_selected, chemicals_selected)
chemical_diet_cov_data <- finalized_data %>% dplyr::select(all_of(c(combined_selected, "hs_zbmi_who")))
set.seed(101)
trainIndex <- createDataPartition(chemical_diet_cov_data$hs_zbmi_who, p = .7, list = FALSE, times = 1)
train_data <- chemical_diet_cov_data[trainIndex,]
test_data <- chemical_diet_cov_data[-trainIndex,]
x_train <- model.matrix(hs_zbmi_who ~ . -1, train_data)
y_train <- train_data$hs_zbmi_who
x_test <- model.matrix(hs_zbmi_who ~ . -1, test_data)
y_test <- test_data$hs_zbmi_who
penalty_factors <- rep(1, ncol(x_train))
penalty_factors[colnames(x_train) %in% covariates_selected] <- 0
num_covariates <- length(covariates_selected)
num_diet <- length(diet_selected)
num_chemicals <- length(chemicals_selected)
# make the group_indices vector
group_indices <- c(
rep(1, num_covariates), # Group 1: Covariates
rep(2, num_diet), # Group 2: Diet
rep(3, num_chemicals) # Group 3: Chemicals
)
# adjust length if needed
if (length(group_indices) < ncol(x_train)) {
group_indices <- c(group_indices, rep(4, ncol(x_train) - length(group_indices)))
}
length(group_indices) == ncol(x_train) # should be TRUE
## [1] TRUE
set.seed(101)
group_lasso_model <- grpreg(x_train, y_train, group = group_indices, penalty = "grLasso",, penalty.factor = penalty_factors, family = "gaussian")
group_lasso_predictions <- predict(group_lasso_model, x_test, type = "response")
mse_group_lasso <- mean((y_test - group_lasso_predictions)^2)
rmse_group_lasso <- sqrt(mse_group_lasso)
cat("Group Lasso MSE:", mse_group_lasso, "\n")
## Group Lasso MSE: 1.041739
cat("Group Lasso RMSE:", rmse_group_lasso, "\n")
## Group Lasso RMSE: 1.020656
set.seed(101)
group_ridge_model <- grpreg(x_train, y_train, group = group_indices, penalty = "grMCP", penalty.factor = penalty_factors, family = "gaussian")
group_ridge_predictions <- predict(group_ridge_model, x_test, type = "response")
mse_group_ridge <- mean((y_test - group_ridge_predictions)^2)
rmse_group_ridge <- sqrt(mse_group_ridge)
cat("Group Ridge MSE:", mse_group_ridge, "\n")
## Group Ridge MSE: 1.036075
cat("Group Ridge RMSE:", rmse_group_ridge, "\n")
## Group Ridge RMSE: 1.017878
set.seed(101)
group_enet_model <- grpreg(x_train, y_train, group = group_indices, penalty = "grSCAD", penalty.factor = penalty_factors, family = "gaussian")
group_enet_predictions <- predict(group_enet_model, x_test, type = "response")
mse_group_enet <- mean((y_test - group_enet_predictions)^2)
rmse_group_enet <- sqrt(mse_group_enet)
cat("Group Elastic Net MSE:", mse_group_enet, "\n")
## Group Elastic Net MSE: 1.04044
cat("Group Elastic Net RMSE:", rmse_group_enet, "\n")
## Group Elastic Net RMSE: 1.020019
Lasso and Ridge Regression: These models show marked improvements, suggesting that the combined diet and chemical variables significantly enhance predictive power.
set.seed(101)
fit_tree <- rpart(hs_zbmi_who ~ ., data = train_data, method = "anova")
rpart.plot(fit_tree)
tree_predictions <- predict(fit_tree, newdata = test_data)
mse_tree <- mean((y_test - tree_predictions)^2)
rmse_tree <- sqrt(mse_tree)
cat("Diet + Chemical + Covariates Decision Tree MSE:", mse_tree, "\n")
## Diet + Chemical + Covariates Decision Tree MSE: 1.240155
cat("Diet + Chemical + Covariates Decision Tree RMSE:", rmse_tree, "\n")
## Diet + Chemical + Covariates Decision Tree RMSE: 1.113622
The Decision Tree model shows an improvement with the inclusion of diet and chemical variables, with MSE remaining at 1.240 and RMSE at 1.114. However, the change is not as pronounced as in regularization methods.
set.seed(101)
fit_rf <- randomForest(hs_zbmi_who ~ ., data = train_data, ntree = 500)
varImpPlot(fit_rf)
importance(fit_rf)
## IncNodePurity
## hs_child_age_None 35.114980
## h_cohort 50.347975
## e3_sex_None 5.813078
## e3_yearbir_None 25.549233
## h_bfdur_Ter 12.815405
## hs_bakery_prod_Ter 19.147220
## hs_break_cer_Ter 12.261038
## hs_dairy_Ter 10.800194
## hs_fastfood_Ter 10.750048
## hs_org_food_Ter 9.449158
## hs_proc_meat_Ter 10.641237
## hs_total_fish_Ter 10.840496
## hs_total_fruits_Ter 13.120566
## hs_total_lipids_Ter 10.943549
## hs_total_sweets_Ter 11.777760
## hs_total_veg_Ter 11.552907
## hs_cd_c_Log2 41.428697
## hs_co_c_Log2 42.742568
## hs_cs_c_Log2 35.785447
## hs_cu_c_Log2 55.939681
## hs_hg_c_Log2 40.514511
## hs_mo_c_Log2 49.756941
## hs_pb_c_Log2 38.766595
## hs_dde_cadj_Log2 80.839631
## hs_pcb153_cadj_Log2 76.590310
## hs_pcb170_cadj_Log2 126.930304
## hs_dep_cadj_Log2 46.803114
## hs_pbde153_cadj_Log2 84.266968
## hs_pfhxs_c_Log2 35.959135
## hs_pfoa_c_Log2 49.980790
## hs_pfos_c_Log2 44.717995
## hs_prpa_cadj_Log2 43.147837
## hs_mbzp_cadj_Log2 44.534878
## hs_mibp_cadj_Log2 34.411750
## hs_mnbp_cadj_Log2 43.971153
rf_predictions <- predict(fit_rf, newdata = test_data)
mse_rf <- mean((y_test - rf_predictions)^2)
rmse_rf <- sqrt(mse_rf)
cat("Diet + Chemical + Covariates Random Forest MSE:", mse_rf, "\n")
## Diet + Chemical + Covariates Random Forest MSE: 1.01944
cat("Diet + Chemical + Covariates Random Forest RMSE:", rmse_rf, "\n")
## Diet + Chemical + Covariates Random Forest RMSE: 1.009673
The Random Forest model shows one of the most significant improvements, with MSE dropping from 1.286 to 1.025 and RMSE from 1.134 to 1.012.
set.seed(101)
fit_gbm <- gbm(hs_zbmi_who ~ ., data = train_data, distribution = "gaussian", n.trees = 100, interaction.depth = 3, shrinkage = 0.01, cv.folds = 5)
summary(fit_gbm)
best_iter <- gbm.perf(fit_gbm, method = "cv")
gbm_predictions <- predict(fit_gbm, newdata = test_data, n.trees = best_iter)
mse_gbm <- mean((y_test - gbm_predictions)^2)
rmse_gbm <- sqrt(mse_gbm)
cat("Diet + Chemical + Covariates GBM MSE:", mse_gbm, "\n")
## Diet + Chemical + Covariates GBM MSE: 1.125755
cat("Diet + Chemical + Covariates GBM RMSE:", rmse_gbm, "\n")
## Diet + Chemical + Covariates GBM RMSE: 1.061016
The GBM model also shows significant improvement, with MSE decreasing from 1.262 to 1.128 and RMSE from 1.124 to 1.062.
load("/Users/allison/Library/CloudStorage/GoogleDrive-aflouie@usc.edu/My Drive/HELIX_data/metabol_serum.RData")
metabol_serum_transposed <- as.data.frame(t(metabol_serum.d))
metabol_serum_transposed$ID <- as.integer(rownames(metabol_serum_transposed))
# add the ID column to the first position
metabol_serum_transposed <- metabol_serum_transposed[, c("ID", setdiff(names(metabol_serum_transposed), "ID"))]
# ID is the first column, and the layout is preserved
kable(head(metabol_serum_transposed), align = "c", digits = 2, format = "pipe")
| ID | metab_1 | metab_2 | metab_3 | metab_4 | metab_5 | metab_6 | metab_7 | metab_8 | metab_9 | metab_10 | metab_11 | metab_12 | metab_13 | metab_14 | metab_15 | metab_16 | metab_17 | metab_18 | metab_19 | metab_20 | metab_21 | metab_22 | metab_23 | metab_24 | metab_25 | metab_26 | metab_27 | metab_28 | metab_29 | metab_30 | metab_31 | metab_32 | metab_33 | metab_34 | metab_35 | metab_36 | metab_37 | metab_38 | metab_39 | metab_40 | metab_41 | metab_42 | metab_43 | metab_44 | metab_45 | metab_46 | metab_47 | metab_48 | metab_49 | metab_50 | metab_51 | metab_52 | metab_53 | metab_54 | metab_55 | metab_56 | metab_57 | metab_58 | metab_59 | metab_60 | metab_61 | metab_62 | metab_63 | metab_64 | metab_65 | metab_66 | metab_67 | metab_68 | metab_69 | metab_70 | metab_71 | metab_72 | metab_73 | metab_74 | metab_75 | metab_76 | metab_77 | metab_78 | metab_79 | metab_80 | metab_81 | metab_82 | metab_83 | metab_84 | metab_85 | metab_86 | metab_87 | metab_88 | metab_89 | metab_90 | metab_91 | metab_92 | metab_93 | metab_94 | metab_95 | metab_96 | metab_97 | metab_98 | metab_99 | metab_100 | metab_101 | metab_102 | metab_103 | metab_104 | metab_105 | metab_106 | metab_107 | metab_108 | metab_109 | metab_110 | metab_111 | metab_112 | metab_113 | metab_114 | metab_115 | metab_116 | metab_117 | metab_118 | metab_119 | metab_120 | metab_121 | metab_122 | metab_123 | metab_124 | metab_125 | metab_126 | metab_127 | metab_128 | metab_129 | metab_130 | metab_131 | metab_132 | metab_133 | metab_134 | metab_135 | metab_136 | metab_137 | metab_138 | metab_139 | metab_140 | metab_141 | metab_142 | metab_143 | metab_144 | metab_145 | metab_146 | metab_147 | metab_148 | metab_149 | metab_150 | metab_151 | metab_152 | metab_153 | metab_154 | metab_155 | metab_156 | metab_157 | metab_158 | metab_159 | metab_160 | metab_161 | metab_162 | metab_163 | metab_164 | metab_165 | metab_166 | metab_167 | metab_168 | metab_169 | metab_170 | metab_171 | metab_172 | metab_173 | metab_174 | metab_175 | metab_176 | metab_177 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 430 | 430 | -2.15 | -0.71 | 8.60 | 0.55 | 7.05 | 5.79 | 3.75 | 5.07 | -1.87 | -2.77 | -3.31 | -2.91 | -2.94 | -1.82 | -4.40 | -4.10 | -5.41 | -5.13 | -5.35 | -3.39 | -5.08 | -6.06 | -6.06 | -4.99 | -4.46 | -4.63 | -3.27 | -4.61 | 2.17 | -1.73 | -4.97 | -4.90 | -2.63 | -5.29 | -2.38 | -4.06 | -5.11 | -5.35 | -4.80 | -3.92 | -3.92 | -5.47 | -4.22 | -2.56 | -3.93 | 5.15 | 6.03 | 10.20 | 5.14 | 7.82 | 12.31 | 7.27 | 7.08 | 1.79 | 7.73 | 7.98 | 1.96 | 6.15 | 0.98 | 0.60 | 4.42 | 4.36 | 5.85 | 1.03 | 2.74 | -2.53 | -2.05 | -2.91 | -1.61 | -1.63 | 5.03 | 0.14 | 6.23 | -2.95 | 1.29 | 1.70 | -2.83 | 4.55 | 4.05 | 2.56 | -0.29 | 8.33 | 9.93 | 4.89 | 1.28 | 2.16 | 5.82 | 8.95 | 7.72 | 8.41 | 4.71 | 0.10 | 2.02 | 0.16 | 5.82 | 7.45 | 6.17 | 6.81 | -0.70 | -1.25 | -0.65 | 2.05 | 3.39 | 4.94 | -0.69 | -1.44 | -2.06 | -2.44 | -1.30 | -0.73 | -1.52 | -2.43 | -3.26 | 1.97 | 0.03 | 1.09 | 3.98 | 4.56 | 4.16 | 0.42 | 3.48 | 4.88 | 3.84 | 4.70 | 4.04 | 1.58 | -0.76 | 1.75 | 2.48 | 4.43 | 4.68 | 3.29 | 0.97 | 1.03 | 0.44 | 1.55 | 2.26 | 2.72 | 0.12 | -0.90 | -0.50 | 0.02 | -0.18 | 1.02 | -2.69 | -1.66 | 0.47 | 0.28 | 6.75 | 7.67 | -2.66 | -1.52 | 7.28 | -0.08 | 2.39 | 1.55 | 3.01 | 2.92 | -0.48 | 6.78 | 3.90 | 4.05 | 3.17 | -1.46 | 3.56 | 4.60 | -3.55 | -2.79 | -1.98 | -1.84 | 3.98 | 6.47 | 7.16 | -0.01 | 6.57 | 6.86 | 8.36 |
| 1187 | 1187 | -0.69 | -0.37 | 9.15 | -1.33 | 6.89 | 5.81 | 4.26 | 5.08 | -2.30 | -3.42 | -3.63 | -3.16 | -3.22 | -1.57 | -4.10 | -5.35 | -5.68 | -6.11 | -5.54 | -3.50 | -5.24 | -5.72 | -5.97 | -4.94 | -4.25 | -4.46 | -3.55 | -4.64 | 1.81 | -2.92 | -4.44 | -4.49 | -3.53 | -4.94 | -3.15 | -4.13 | -4.47 | -4.90 | -4.24 | -3.49 | -3.94 | -4.99 | -4.02 | -2.69 | -3.69 | 5.13 | 5.57 | 9.93 | 6.13 | 8.47 | 12.32 | 6.83 | 5.94 | 1.64 | 6.82 | 7.74 | 1.98 | 6.11 | 0.99 | 0.19 | 4.34 | 4.36 | 5.47 | 0.92 | 2.69 | -2.69 | -1.93 | -2.79 | -1.63 | -1.69 | 4.58 | 0.41 | 6.14 | -3.06 | 1.05 | 2.10 | -2.95 | 4.51 | 4.30 | 2.57 | 0.08 | 8.27 | 9.54 | 4.61 | 1.39 | 1.91 | 5.91 | 8.59 | 7.34 | 8.04 | 4.29 | -0.04 | 2.17 | 0.42 | 5.39 | 6.95 | 5.68 | 6.09 | -0.68 | -1.29 | -0.76 | 1.84 | 3.06 | 4.40 | -0.52 | -1.52 | -1.90 | -2.44 | -1.46 | -1.00 | -1.33 | -2.41 | -3.67 | 2.48 | 0.27 | 1.02 | 4.19 | 4.43 | 4.19 | 0.33 | 3.24 | 4.38 | 3.92 | 5.09 | 4.42 | 1.01 | -0.53 | 1.36 | 2.25 | 4.54 | 5.10 | 3.45 | 0.65 | 0.83 | 0.36 | 1.68 | 2.56 | 2.70 | 0.02 | -1.02 | -0.93 | -0.22 | 0.11 | 1.60 | -2.70 | -1.31 | 1.08 | 0.54 | 6.29 | 7.97 | -3.22 | -1.34 | 7.50 | 0.48 | 2.19 | 1.49 | 3.09 | 2.71 | -0.38 | 6.86 | 3.77 | 4.31 | 3.23 | -1.82 | 3.80 | 5.05 | -3.31 | -2.18 | -2.21 | -2.01 | 4.91 | 6.84 | 7.14 | 0.14 | 6.03 | 6.55 | 7.91 |
| 940 | 940 | -0.69 | -0.36 | 8.95 | -0.13 | 7.10 | 5.86 | 4.35 | 5.92 | -1.97 | -3.40 | -3.41 | -2.99 | -3.01 | -1.65 | -3.55 | -4.82 | -5.41 | -5.84 | -5.13 | -2.83 | -4.86 | -5.51 | -5.51 | -4.63 | -3.73 | -4.00 | -2.92 | -4.21 | 2.79 | -1.41 | -4.80 | -5.47 | -2.10 | -5.47 | -2.14 | -4.18 | -4.84 | -5.24 | -4.64 | -3.20 | -3.90 | -5.24 | -3.77 | -2.70 | -2.76 | 5.21 | 5.86 | 9.78 | 6.38 | 8.29 | 12.49 | 7.01 | 6.49 | 1.97 | 7.17 | 7.62 | 2.40 | 6.93 | 1.85 | 1.45 | 5.11 | 5.30 | 6.27 | 2.35 | 3.31 | -2.50 | -1.41 | -2.61 | -0.93 | -1.03 | 4.54 | 1.59 | 6.03 | -2.74 | 1.79 | 2.68 | -8.16 | 5.19 | 5.14 | 3.16 | 0.24 | 9.09 | 10.25 | 5.44 | 1.90 | 2.46 | 6.66 | 9.19 | 8.24 | 8.46 | 5.73 | 1.10 | 2.58 | 1.15 | 6.37 | 7.28 | 6.51 | 7.20 | -0.48 | -0.69 | -0.02 | 2.56 | 3.76 | 5.33 | -0.16 | -1.18 | -1.18 | -2.16 | -1.06 | -0.19 | -0.48 | -2.35 | -3.16 | 2.79 | 0.72 | 2.14 | 4.80 | 4.84 | 4.55 | 1.27 | 4.26 | 5.23 | 4.40 | 5.43 | 4.56 | 2.32 | 0.03 | 2.15 | 3.22 | 5.06 | 5.28 | 3.80 | 1.38 | 1.58 | 0.98 | 2.27 | 2.94 | 3.39 | 0.33 | -0.53 | 0.17 | 0.53 | 0.57 | 1.69 | -2.21 | -0.76 | 1.25 | 0.49 | 6.49 | 8.84 | -4.02 | -1.33 | 7.42 | 0.71 | 2.81 | 2.03 | 3.30 | 3.00 | -0.24 | 7.02 | 3.82 | 4.66 | 3.36 | -1.18 | 3.82 | 4.91 | -2.95 | -2.89 | -2.43 | -2.05 | 4.25 | 7.02 | 7.36 | 0.14 | 6.57 | 6.68 | 8.12 |
| 936 | 936 | -0.19 | -0.34 | 8.54 | -0.62 | 7.01 | 5.95 | 4.24 | 5.41 | -1.89 | -2.84 | -3.38 | -3.11 | -2.94 | -1.45 | -3.83 | -4.43 | -5.61 | -5.41 | -5.54 | -2.94 | -4.78 | -6.06 | -5.88 | -4.70 | -4.82 | -4.46 | -2.66 | -3.82 | 2.85 | -2.70 | -5.16 | -5.47 | -3.31 | -5.61 | -2.80 | -4.11 | -4.97 | -4.86 | -5.01 | -3.63 | -3.78 | -5.29 | -4.17 | -2.49 | -3.65 | 5.31 | 5.60 | 9.87 | 6.67 | 8.05 | 12.33 | 6.72 | 6.42 | 1.25 | 7.28 | 7.37 | 1.99 | 6.28 | 1.17 | 0.50 | 4.52 | 4.43 | 5.54 | 1.30 | 3.08 | -2.92 | -2.16 | -3.18 | -1.66 | -1.63 | 4.55 | 0.53 | 5.73 | -3.27 | 1.30 | 1.70 | -2.57 | 4.53 | 4.14 | 2.61 | -0.18 | 8.32 | 9.62 | 4.82 | 1.58 | 1.99 | 5.82 | 8.59 | 7.58 | 8.39 | 4.68 | 0.36 | 2.01 | -0.31 | 5.71 | 7.35 | 6.22 | 6.66 | -0.70 | -1.42 | -0.62 | 2.13 | 3.54 | 4.85 | -0.72 | -1.53 | -2.04 | -2.37 | -1.38 | -0.96 | -1.57 | -2.91 | -3.60 | 2.37 | 0.21 | 0.92 | 4.05 | 4.27 | 4.33 | 0.24 | 3.38 | 4.45 | 3.71 | 4.74 | 4.44 | 1.51 | -1.73 | 1.51 | 2.27 | 4.37 | 4.89 | 3.40 | 0.66 | 0.83 | 0.27 | 1.50 | 2.30 | 2.60 | 0.14 | -0.90 | -0.99 | -0.53 | -0.30 | 1.14 | -3.06 | -1.69 | 0.39 | 0.19 | 6.21 | 8.05 | -2.75 | -0.87 | 7.79 | 0.87 | 2.48 | 1.62 | 3.28 | 2.93 | -0.41 | 6.91 | 3.75 | 4.38 | 3.20 | -1.07 | 3.81 | 4.89 | -3.36 | -2.40 | -2.06 | -2.03 | 3.99 | 7.36 | 6.94 | 0.14 | 6.26 | 6.47 | 7.98 |
| 788 | 788 | -1.96 | -0.35 | 8.73 | -0.80 | 6.90 | 5.95 | 4.88 | 5.39 | -1.55 | -2.45 | -3.51 | -2.84 | -2.83 | -1.71 | -3.91 | -4.05 | -5.61 | -4.63 | -5.29 | -3.51 | -4.86 | -5.97 | -5.27 | -4.90 | -4.40 | -4.63 | -3.11 | -3.99 | 2.87 | -2.23 | -4.61 | -5.04 | -3.53 | -5.08 | -3.02 | -4.41 | -4.72 | -5.18 | -4.72 | -3.63 | -3.61 | -5.29 | -4.05 | -2.31 | -3.73 | 4.69 | 5.31 | 9.69 | 6.76 | 8.21 | 12.18 | 6.75 | 6.51 | 1.15 | 7.38 | 7.93 | 1.76 | 5.68 | -0.02 | -0.65 | 4.14 | 3.36 | 4.43 | 0.21 | 1.98 | -2.31 | -1.54 | -2.30 | -1.66 | -1.47 | 4.48 | 0.88 | 6.47 | -2.50 | 0.74 | 1.12 | -2.17 | 4.31 | 3.50 | 2.09 | -0.60 | 8.06 | 9.69 | 3.99 | 0.54 | 1.60 | 5.60 | 8.71 | 7.32 | 8.03 | 3.27 | -0.98 | 1.59 | -0.20 | 5.68 | 7.16 | 5.57 | 6.16 | -0.79 | -1.31 | -0.87 | 2.17 | 3.23 | 4.57 | -0.93 | -1.80 | -2.27 | -2.51 | -1.74 | -1.02 | -1.92 | -2.02 | -3.79 | 1.95 | -0.24 | 0.40 | 3.73 | 4.13 | 3.71 | 0.03 | 2.89 | 4.06 | 3.54 | 4.76 | 3.88 | 0.53 | -2.11 | 1.27 | 1.99 | 4.13 | 4.58 | 2.88 | 0.22 | 0.39 | 0.22 | 1.44 | 2.02 | 2.22 | 0.00 | -0.81 | -1.10 | -0.41 | -0.09 | 1.00 | -2.66 | -1.55 | 0.33 | 0.19 | 6.47 | 7.89 | -4.40 | -1.94 | 7.65 | 0.38 | 1.66 | 0.84 | 2.78 | 2.26 | -0.84 | 6.52 | 3.53 | 3.81 | 2.83 | -1.69 | 3.65 | 4.47 | -3.81 | -2.97 | -2.88 | -2.29 | 3.88 | 6.99 | 7.38 | -0.10 | 6.00 | 6.52 | 8.04 |
| 698 | 698 | -1.90 | -0.63 | 8.24 | -0.46 | 6.94 | 5.42 | 4.70 | 4.62 | -1.78 | -3.14 | -3.46 | -2.90 | -2.94 | -1.65 | -4.20 | -4.56 | -5.68 | -5.61 | -5.41 | -2.92 | -5.04 | -5.97 | -6.06 | -4.90 | -4.22 | -4.20 | -3.05 | -4.61 | 2.15 | -2.87 | -4.68 | -5.08 | -3.69 | -5.24 | -3.63 | -4.24 | -5.16 | -5.35 | -4.97 | -3.61 | -3.99 | -5.35 | -3.98 | -2.59 | -3.95 | 5.15 | 5.82 | 10.00 | 5.54 | 8.15 | 12.28 | 6.80 | 6.23 | 1.88 | 7.07 | 7.38 | 2.06 | 6.79 | 1.67 | 1.00 | 4.79 | 4.79 | 5.71 | 1.99 | 3.29 | -2.13 | -1.01 | -1.85 | -1.23 | -0.90 | 4.41 | -0.02 | 6.09 | -2.10 | 1.66 | 2.27 | -3.48 | 4.96 | 4.76 | 2.64 | 0.05 | 8.91 | 9.99 | 5.16 | 1.53 | 2.11 | 6.28 | 8.77 | 8.03 | 8.66 | 5.99 | 0.87 | 2.30 | 0.63 | 6.23 | 7.50 | 6.75 | 7.22 | -0.45 | -0.81 | -0.11 | 2.57 | 3.93 | 5.16 | -0.31 | -1.19 | -1.25 | -1.93 | -0.89 | 0.07 | -0.87 | -1.12 | -3.03 | 2.61 | 0.54 | 1.83 | 4.50 | 4.53 | 4.42 | 1.15 | 4.02 | 4.91 | 4.06 | 5.06 | 4.42 | 2.02 | -1.03 | 1.87 | 2.96 | 4.84 | 5.08 | 3.62 | 1.13 | 1.23 | 0.75 | 2.26 | 2.80 | 3.04 | 0.41 | -0.39 | 0.02 | 0.31 | 0.52 | 1.73 | -2.28 | -0.73 | 1.06 | 0.72 | 6.44 | 7.27 | -3.08 | -1.23 | 7.35 | 0.92 | 2.60 | 2.00 | 3.69 | 3.20 | -0.25 | 7.38 | 4.15 | 5.00 | 3.88 | -1.39 | 4.31 | 5.20 | -3.47 | -2.75 | -1.97 | -1.96 | 4.18 | 6.81 | 6.75 | 0.02 | 6.49 | 5.97 | 7.78 |
# specific covariates
load("/Users/allison/Library/CloudStorage/GoogleDrive-aflouie@usc.edu/My Drive/HELIX_data/HELIX.RData")
filtered_chem_diet <- codebook %>%
filter(domain %in% c("Chemicals", "Lifestyles") & period == "Postnatal" & subfamily != "Allergens")
# specific covariates
filtered_covariates <- codebook %>%
filter(domain == "Covariates" &
variable_name %in% c("ID", "e3_sex_None", "e3_yearbir_None", "h_cohort", "hs_child_age_None"))
#specific phenotype variables
filtered_phenotype <- codebook %>%
filter(domain == "Phenotype" &
variable_name %in% c("hs_zbmi_who"))
# combining all necessary variables together
combined_codebook <- bind_rows(filtered_chem_diet, filtered_covariates, filtered_phenotype)
outcome_and_cov <- cbind(covariates, outcome_BMI)
outcome_and_cov <- outcome_and_cov[, !duplicated(colnames(outcome_and_cov))]
outcome_and_cov <- outcome_and_cov %>%
dplyr::select(ID, hs_child_age_None, h_cohort, e3_sex_None, e3_yearbir_None, hs_zbmi_who)
#the full chemicals list
chemicals_specific <- c(
"hs_cd_c_Log2",
"hs_co_c_Log2",
"hs_cs_c_Log2",
"hs_cu_c_Log2",
"hs_hg_c_Log2",
"hs_mo_c_Log2",
"hs_pb_c_Log2",
"hs_dde_cadj_Log2",
"hs_pcb153_cadj_Log2",
"hs_pcb170_cadj_Log2",
"hs_dep_cadj_Log2",
"hs_pbde153_cadj_Log2",
"hs_pfhxs_c_Log2",
"hs_pfoa_c_Log2",
"hs_pfos_c_Log2",
"hs_prpa_cadj_Log2",
"hs_mbzp_cadj_Log2",
"hs_mibp_cadj_Log2",
"hs_mnbp_cadj_Log2"
)
#postnatal diet for child
postnatal_diet <- c(
"h_bfdur_Ter",
"hs_bakery_prod_Ter",
"hs_dairy_Ter",
"hs_fastfood_Ter",
"hs_org_food_Ter",
"hs_readymade_Ter",
"hs_total_bread_Ter",
"hs_total_fish_Ter",
"hs_total_fruits_Ter",
"hs_total_lipids_Ter",
"hs_total_potatoes_Ter",
"hs_total_sweets_Ter",
"hs_total_veg_Ter"
)
covariates_selected <- c("hs_child_age_None", "h_cohort", "e3_sex_None", "e3_yearbir_None")
all_columns <- c(chemicals_specific, postnatal_diet)
extracted_exposome <- exposome %>% dplyr::select(all_of(all_columns))
selected_id_data <- cbind(outcome_and_cov, extracted_exposome)
# ID is the common identifier in both datasets
combined_data <- merge(selected_id_data, metabol_serum_transposed, by = "ID", all = TRUE)
selected_metabolomics_data <- combined_data %>% dplyr::select(-c(ID))
head(selected_metabolomics_data)
selected_metabolomics_data <- selected_metabolomics_data %>% na.omit()
set.seed(101)
trainIndex <- createDataPartition(selected_metabolomics_data$hs_zbmi_who, p = .7, list = FALSE, times = 1)
metabol_train_data <- selected_metabolomics_data[trainIndex,]
metabol_test_data <- selected_metabolomics_data[-trainIndex,]
x_train <- model.matrix(hs_zbmi_who ~ . -1, metabol_train_data)
y_train <- metabol_train_data$hs_zbmi_who
x_test <- model.matrix(hs_zbmi_who ~ . -1, metabol_test_data)
y_test <- metabol_test_data$hs_zbmi_who
penalty_factors <- rep(1, ncol(x_train))
penalty_factors[colnames(x_train) %in% covariates_selected] <- 0
num_covariates <- length(covariates_selected)
num_diet <- length(diet_selected)
num_chemicals <- length(chemicals_selected)
num_metabolomics <- ncol(metabol_serum_transposed) - 1 # subtract for the ID column
group_indices <- c(
rep(1, num_covariates), # Group 1: Covariates
rep(2, num_diet), # Group 2: Diet
rep(3, num_chemicals), # Group 3: Chemicals
rep(4, num_metabolomics) # Group 4: Metabolomics
)
if (length(group_indices) < ncol(x_train)) {
group_indices <- c(group_indices, rep(5, ncol(x_train) - length(group_indices)))
}
length(group_indices) == ncol(x_train) # This should be TRUE
## [1] TRUE
# Fit group lasso model
set.seed(101)
group_lasso_model <- grpreg(x_train, y_train, group = group_indices, penalty = "grLasso", penalty.factor = penalty_factors, family = "gaussian")
group_lasso_predictions <- predict(group_lasso_model, x_test, type = "response")
mse_group_lasso <- mean((y_test - group_lasso_predictions)^2)
rmse_group_lasso <- sqrt(mse_group_lasso)
cat("Group Lasso Test MSE:", mse_group_lasso, "\n")
## Group Lasso Test MSE: 0.8706314
cat("Group Lasso Test RMSE:", rmse_group_lasso, "\n")
## Group Lasso Test RMSE: 0.9330763
set.seed(101)
group_ridge_model <- grpreg(x_train, y_train, group = group_indices, penalty = "grMCP", penalty.factor = penalty_factors, family = "gaussian")
group_ridge_predictions <- predict(group_ridge_model, x_test, type = "response")
mse_group_ridge <- mean((y_test - group_ridge_predictions)^2)
rmse_group_ridge <- sqrt(mse_group_ridge)
cat("Group Ridge MSE:", mse_group_ridge, "\n")
## Group Ridge MSE: 0.8862179
cat("Group Ridge RMSE:", rmse_group_ridge, "\n")
## Group Ridge RMSE: 0.9413914
set.seed(101)
group_enet_model <- grpreg(x_train, y_train, group = group_indices, penalty = "grSCAD", penalty.factor = penalty_factors, family = "gaussian")
group_enet_predictions <- predict(group_enet_model, x_test, type = "response")
mse_group_enet <- mean((y_test - group_enet_predictions)^2)
rmse_group_enet <- sqrt(mse_group_enet)
cat("Group Elastic Net MSE:", mse_group_enet, "\n")
## Group Elastic Net MSE: 0.8889618
cat("Group Elastic Net RMSE:", rmse_group_enet, "\n")
## Group Elastic Net RMSE: 0.9428477
Lasso Regression: Shows the most significant improvement, demonstrating the value of the combined variables in predictive modeling.
selected_metabolomics_data <- selected_metabolomics_data %>% na.omit()
set.seed(101)
trainIndex <- createDataPartition(selected_metabolomics_data$hs_zbmi_who, p = .7,
list = FALSE,
times = 1)
train_data <- selected_metabolomics_data[ trainIndex,]
test_data <- selected_metabolomics_data[-trainIndex,]
x_train <- model.matrix(hs_zbmi_who ~ ., train_data)[,-1]
y_train <- train_data$hs_zbmi_who
x_test <- model.matrix(hs_zbmi_who ~ ., test_data)[,-1]
y_test <- test_data$hs_zbmi_who
set.seed(101)
tree_model <- rpart(hs_zbmi_who ~ ., data = train_data, method = "anova")
rpart.plot(tree_model)
tree_predictions <- predict(tree_model, newdata = test_data)
tree_mse <- mean((tree_predictions - y_test)^2)
rmse_tree <- sqrt(tree_mse)
cat("Diet + Chemical + Metabolomic + Covariates Decision Tree MSE:", tree_mse, "\n")
## Diet + Chemical + Metabolomic + Covariates Decision Tree MSE: 1.545318
cat("Diet + Chemical + Metabolomic + Covariates Decision Tree RMSE:", rmse_tree, "\n")
## Diet + Chemical + Metabolomic + Covariates Decision Tree RMSE: 1.243108
The Decision Tree model shows a higher MSE of 1.545 and RMSE of 1.243 when all variables are included, indicating it may struggle with the complexity introduced by metabolomic data.
set.seed(101)
rf_model <- randomForest(hs_zbmi_who ~ . , data = train_data, ntree = 500)
importance(rf_model)
## IncNodePurity
## hs_child_age_None 4.3510904
## h_cohort 11.4847925
## e3_sex_None 0.6469470
## e3_yearbir_None 5.0397600
## hs_cd_c_Log2 5.4034834
## hs_co_c_Log2 5.0423993
## hs_cs_c_Log2 4.6266269
## hs_cu_c_Log2 11.3629923
## hs_hg_c_Log2 5.3986631
## hs_mo_c_Log2 10.5561526
## hs_pb_c_Log2 6.0375115
## hs_dde_cadj_Log2 15.2674613
## hs_pcb153_cadj_Log2 43.7859103
## hs_pcb170_cadj_Log2 77.4592518
## hs_dep_cadj_Log2 6.9179021
## hs_pbde153_cadj_Log2 29.2165376
## hs_pfhxs_c_Log2 5.5574508
## hs_pfoa_c_Log2 9.1473793
## hs_pfos_c_Log2 6.2406329
## hs_prpa_cadj_Log2 5.5785583
## hs_mbzp_cadj_Log2 4.8381018
## hs_mibp_cadj_Log2 4.2835618
## hs_mnbp_cadj_Log2 4.0952772
## h_bfdur_Ter 3.0556126
## hs_bakery_prod_Ter 2.8734807
## hs_dairy_Ter 1.1982993
## hs_fastfood_Ter 0.7788861
## hs_org_food_Ter 1.0689860
## hs_readymade_Ter 1.7494233
## hs_total_bread_Ter 1.3698662
## hs_total_fish_Ter 1.3784976
## hs_total_fruits_Ter 1.1543863
## hs_total_lipids_Ter 1.0160675
## hs_total_potatoes_Ter 1.5864111
## hs_total_sweets_Ter 1.0535255
## hs_total_veg_Ter 1.2142266
## metab_1 5.0246178
## metab_2 4.9975168
## metab_3 3.4643572
## metab_4 5.5046649
## metab_5 3.9463992
## metab_6 7.3085806
## metab_7 4.4257574
## metab_8 31.8547065
## metab_9 2.9959180
## metab_10 3.1758405
## metab_11 3.8734754
## metab_12 3.2738085
## metab_13 5.0382976
## metab_14 4.8505501
## metab_15 4.6498518
## metab_16 2.8186537
## metab_17 2.5814905
## metab_18 3.4531828
## metab_19 2.2677950
## metab_20 3.5314013
## metab_21 2.2608049
## metab_22 2.5923937
## metab_23 2.9864977
## metab_24 3.7658293
## metab_25 3.4717235
## metab_26 7.5922897
## metab_27 3.3079110
## metab_28 3.0942746
## metab_29 3.4194697
## metab_30 19.0900288
## metab_31 3.7296241
## metab_32 2.9438450
## metab_33 4.7884692
## metab_34 2.5374169
## metab_35 7.8801757
## metab_36 3.5484238
## metab_37 3.4615654
## metab_38 2.8551929
## metab_39 2.9023861
## metab_40 5.5303117
## metab_41 4.1589018
## metab_42 5.7814479
## metab_43 2.9051396
## metab_44 3.1993167
## metab_45 3.8627595
## metab_46 4.7165866
## metab_47 6.5126371
## metab_48 11.6532289
## metab_49 34.0040406
## metab_50 10.2784818
## metab_51 5.5997592
## metab_52 3.5422826
## metab_53 5.1768364
## metab_54 4.5873675
## metab_55 8.5822696
## metab_56 3.5655906
## metab_57 4.6641997
## metab_58 3.3604162
## metab_59 5.7724297
## metab_60 4.8328602
## metab_61 3.5217162
## metab_62 3.6407068
## metab_63 4.4807353
## metab_64 4.3958412
## metab_65 3.2690407
## metab_66 2.6464596
## metab_67 2.9324104
## metab_68 3.9480657
## metab_69 2.6555957
## metab_70 2.6514940
## metab_71 4.2321834
## metab_72 3.7483895
## metab_73 3.5601435
## metab_74 2.5232306
## metab_75 4.1790361
## metab_76 2.3825409
## metab_77 4.5203448
## metab_78 4.2542324
## metab_79 3.9202219
## metab_80 3.5883304
## metab_81 3.3478113
## metab_82 5.2828197
## metab_83 4.0533988
## metab_84 3.3750068
## metab_85 5.5406981
## metab_86 3.5952497
## metab_87 3.1383387
## metab_88 2.6259304
## metab_89 2.6560929
## metab_90 2.6864011
## metab_91 2.6201838
## metab_92 3.0571047
## metab_93 3.2157707
## metab_94 9.3374896
## metab_95 52.7476025
## metab_96 7.0768415
## metab_97 3.4158131
## metab_98 3.4086182
## metab_99 5.7843825
## metab_100 3.7188542
## metab_101 2.7941056
## metab_102 4.9365328
## metab_103 3.6992223
## metab_104 4.7269214
## metab_105 3.6214074
## metab_106 3.6727061
## metab_107 3.6544399
## metab_108 3.2550617
## metab_109 5.4854345
## metab_110 7.4270442
## metab_111 2.6425116
## metab_112 2.7255616
## metab_113 5.4538156
## metab_114 3.3385618
## metab_115 5.2528540
## metab_116 4.6202248
## metab_117 6.6989808
## metab_118 2.5575078
## metab_119 6.1560378
## metab_120 7.0372080
## metab_121 4.0063548
## metab_122 6.9004492
## metab_123 2.9740618
## metab_124 3.8403649
## metab_125 3.0350595
## metab_126 2.5287840
## metab_127 6.3605217
## metab_128 6.3535024
## metab_129 3.6497741
## metab_130 3.5048690
## metab_131 2.7358289
## metab_132 3.1416537
## metab_133 2.8089910
## metab_134 3.8237787
## metab_135 4.2005101
## metab_136 5.3871803
## metab_137 7.2446208
## metab_138 5.6624417
## metab_139 3.1680018
## metab_140 3.2342640
## metab_141 7.0874046
## metab_142 13.2308029
## metab_143 8.1427795
## metab_144 3.6521039
## metab_145 3.9931242
## metab_146 4.1109424
## metab_147 3.7136688
## metab_148 3.4295763
## metab_149 5.0799668
## metab_150 5.2297917
## metab_151 3.6133544
## metab_152 4.4017111
## metab_153 4.1872432
## metab_154 4.0488667
## metab_155 2.6370643
## metab_156 2.6304464
## metab_157 3.3068935
## metab_158 3.3021198
## metab_159 2.8824300
## metab_160 8.1179141
## metab_161 26.5008653
## metab_162 3.7626315
## metab_163 16.1969248
## metab_164 6.8100777
## metab_165 3.6030471
## metab_166 3.9155699
## metab_167 3.3543066
## metab_168 3.0729228
## metab_169 4.0738976
## metab_170 4.7208531
## metab_171 4.1684339
## metab_172 3.6712673
## metab_173 4.0493143
## metab_174 4.0874861
## metab_175 4.4045373
## metab_176 5.7419058
## metab_177 14.8456884
par(mar = c(6, 14, 4, 4) + 0.1)
varImpPlot(rf_model, cex = 0.6)
rf_predictions <- predict(rf_model, newdata = test_data)
rf_mse <- mean((rf_predictions - y_test)^2)
rmse_rf <- sqrt(rf_mse)
cat(" Diet + Chemical + Metabolomic + Covariates Random Forest MSE:", rf_mse, "\n")
## Diet + Chemical + Metabolomic + Covariates Random Forest MSE: 1.005087
cat("Diet + Chemical + Metabolomic + Covariates Random Forest RMSE:", rmse_rf, "\n")
## Diet + Chemical + Metabolomic + Covariates Random Forest RMSE: 1.00254
Continues to perform robustly, though the improvement is not as pronounced as Lasso or GBM.
gbm_model <- gbm(hs_zbmi_who ~ ., data = train_data,
distribution = "gaussian",
n.trees = 1000,
interaction.depth = 3,
n.minobsinnode = 10,
shrinkage = 0.01,
cv.folds = 5,
verbose = TRUE)
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.4864 nan 0.0100 0.0033
## 2 1.4807 nan 0.0100 0.0045
## 3 1.4769 nan 0.0100 0.0019
## 4 1.4720 nan 0.0100 0.0037
## 5 1.4653 nan 0.0100 0.0057
## 6 1.4596 nan 0.0100 0.0042
## 7 1.4552 nan 0.0100 0.0030
## 8 1.4516 nan 0.0100 0.0029
## 9 1.4463 nan 0.0100 0.0029
## 10 1.4402 nan 0.0100 0.0056
## 20 1.3955 nan 0.0100 0.0036
## 40 1.3186 nan 0.0100 0.0021
## 60 1.2544 nan 0.0100 0.0011
## 80 1.1915 nan 0.0100 0.0014
## 100 1.1374 nan 0.0100 0.0018
## 120 1.0889 nan 0.0100 0.0009
## 140 1.0444 nan 0.0100 0.0017
## 160 1.0035 nan 0.0100 0.0005
## 180 0.9665 nan 0.0100 0.0006
## 200 0.9327 nan 0.0100 0.0000
## 220 0.9018 nan 0.0100 0.0004
## 240 0.8717 nan 0.0100 0.0002
## 260 0.8433 nan 0.0100 0.0004
## 280 0.8178 nan 0.0100 0.0001
## 300 0.7931 nan 0.0100 0.0004
## 320 0.7704 nan 0.0100 0.0002
## 340 0.7492 nan 0.0100 0.0002
## 360 0.7299 nan 0.0100 0.0005
## 380 0.7107 nan 0.0100 0.0001
## 400 0.6927 nan 0.0100 0.0001
## 420 0.6760 nan 0.0100 0.0002
## 440 0.6604 nan 0.0100 -0.0002
## 460 0.6439 nan 0.0100 -0.0001
## 480 0.6288 nan 0.0100 0.0001
## 500 0.6138 nan 0.0100 -0.0002
## 520 0.6004 nan 0.0100 0.0005
## 540 0.5877 nan 0.0100 0.0002
## 560 0.5757 nan 0.0100 -0.0001
## 580 0.5642 nan 0.0100 -0.0001
## 600 0.5530 nan 0.0100 -0.0000
## 620 0.5416 nan 0.0100 -0.0001
## 640 0.5309 nan 0.0100 -0.0003
## 660 0.5208 nan 0.0100 -0.0001
## 680 0.5108 nan 0.0100 0.0000
## 700 0.5014 nan 0.0100 -0.0001
## 720 0.4919 nan 0.0100 0.0002
## 740 0.4830 nan 0.0100 -0.0001
## 760 0.4737 nan 0.0100 -0.0001
## 780 0.4653 nan 0.0100 -0.0002
## 800 0.4570 nan 0.0100 -0.0002
## 820 0.4492 nan 0.0100 -0.0001
## 840 0.4421 nan 0.0100 -0.0001
## 860 0.4347 nan 0.0100 -0.0000
## 880 0.4274 nan 0.0100 -0.0000
## 900 0.4207 nan 0.0100 -0.0000
## 920 0.4136 nan 0.0100 -0.0000
## 940 0.4068 nan 0.0100 -0.0003
## 960 0.4006 nan 0.0100 -0.0001
## 980 0.3946 nan 0.0100 -0.0003
## 1000 0.3880 nan 0.0100 0.0000
summary(gbm_model)
# finding the best number of trees based on cross-validation
best_trees <- gbm.perf(gbm_model, method = "cv")
predictions_gbm <- predict(gbm_model, test_data, n.trees = best_trees)
mse_gbm <- mean((y_test - predictions_gbm)^2)
rmse_gbm <- sqrt(mse_gbm)
cat("Diet + Chemical + Metabolomic + Covariates GBM MSE:", mse_gbm, "\n")
## Diet + Chemical + Metabolomic + Covariates GBM MSE: 0.8890887
cat("Diet + Chemical + Metabolomic + Covariates GBM RMSE:", rmse_gbm, "\n")
## Diet + Chemical + Metabolomic + Covariates GBM RMSE: 0.942915
summary(gbm_model)
Including diet, chemical, and metabolomic variables with the covariates leads to the best improvements in model performance, indicating these additional predictors provide substantial predictive value.
The GBM model shows the best performance among the advanced models, with MSE decreasing to 0.889 and RMSE to 0.943, highlighting its ability to effectively leverage a comprehensive set of predictors.
results <- data.frame(
Model = rep(c("Lasso", "Ridge", "Elastic Net", "Decision Tree", "Random Forest", "GBM"), each = 5),
Data_Set = rep(c("Baseline", "Diet + Covariates", "Chemicals + Covariates", "Diet + Chemicals + Covariates", "Full Model"), 6),
MSE = runif(30, min = 0.2, max = 0.5), # Replace with actual MSE values
RMSE = runif(30, min = 0.4, max = 0.7) # Replace with actual RMSE values
)
mse_plot <- ggplot(results, aes(x = Data_Set, y = MSE, fill = Model)) +
geom_bar(stat = "identity", position = position_dodge()) +
labs(title = "Model MSE Comparison", x = "Data Set", y = "Mean Squared Error") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
rmse_plot <- ggplot(results, aes(x = Data_Set, y = RMSE, fill = Model)) +
geom_bar(stat = "identity", position = position_dodge()) +
labs(title = "Model RMSE Comparison", x = "Data Set", y = "Root Mean Squared Error") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
print(mse_plot)
print(rmse_plot)